There’s been much ado about the pros and cons of artificial intelligence over the last few months since the start of the ChatGPT era. Generative AI has hit the mainstream; new vendor solutions are cropping up daily and professionals from many different industries are giving it a test drive. Most of the data fueling these new tools is unstructured; success with generative AI requires a comprehensive unstructured data governance strategy. In this two-part blog series, we’ll cover the expanding field of AI data governance. I’ll explain the data risks from generative AI and the role of unstructured data management in mitigating risks.
With proper guardrails and the right tools, organizations can safely take advantage of these new AI solutions for a variety of use cases.
A recent New York Times article interviewed doctors from different specialties about their experiences using ChatGPT to communicate more compassionately with their patients. One doctor used ChatGPT to write a letter in response to an insurer that denied paying for the off-label use of an expensive medication. After receiving the bot’s letter, the insurer granted the request. We are hearing of many other promising uses of generative AI – from marketing to operations and R&D.
“In short, anything that people do with their natural intelligence today can be done much better with AI, and we will be able to take on new challenges that have been impossible to tackle without AI, from curing all diseases to achieving interstellar travel,” wrote Marc Andreessen.
The Need for AI Data Governance
Yet, this is not the whole truth. There are real dangers with AI and generative AI has brought this sharply to the forefront. Executives are rightly worried about the unintended outcomes of this new technology. Evidence of employees leaking corporate data into ChatGPT abound. People worry that it’s going to kill their careers, steal their identity, rob them of their financial assets, and worse: doomsday predictions abound. The reality is likely somewhere in the middle.
Applications like ChatGPT seem intelligent and creative in a humanlike way: they are generating new content using pattern matching and are pretrained with large data sets, or large learning models (LLMs). The dangers lie within these LLMs, because there are many risks and unknowns with the data.
Companies and organizations need to understand the data management issues that relate to generative AI. Let’s look at five key areas for AI data governance to consider across security, privacy, lineage, ownership and governance of unstructured data for AI –or SPLOG.
SPLOG and Unstructured Data Management with Generative AI
Security: Data confidentiality and security are at risk with third-party generative AI applications because your data becomes part of the LLM and the public domain once you feed it into a tool. Get clear on the legal agreements in place by the vendor as pertains to your data. There are new ways to manage this now: ChatGPT now allows users to disable chat history so that chats won’t be used to train its models, although OpenAI retains the data for 30 days. One way to protect your organization is to segregate sensitive and proprietary data into a private, secure domain which restricts sharing with commercial applications. You can also maintain an audit trail of your corporate data that has fed AI applications.
Privacy: When you create a prompt for an AI tool to produce an output based on your query, you don’t know if the result will include protected data, such as PII, from another organization. Your company may be liable if you use the tool’s output externally in content or a product and the PII is discoverable. As well, since non-AI vendors are now incorporating AI tools into their solutions, perhaps even without their customers’ knowledge, the risk compounds. Your commercial backup solution could incorporate a pretrained model to find anomalies in your data and that model may contain PII data; this could indirectly put you at a risk of violation. Data provenance and transparency around the training data used in an AI application are critical to ensure privacy.
Lineage: Today there is not much transparency with data sources in generative AI applications. They may contain biased, libelous or unverified data sources. This makes using GenAI tools circumspect when you need results that are factually accurate and objective. Consider the problem you are trying to solve with AI to choose the right tool. Machine learning systems are better for tasks which require a deterministic outcome.
Ownership: The data ownership piece of generative AI concerns what happens when you derive a work: who owns the IP? As it stands today, copyright law dictates that “works created solely by artificial intelligence — even if produced from a text prompt written by a human — are not protected by copyright,” according to reporting by BuiltIn. As well, the article continues, copyrighted materials used in training AI models, is permitted under the fair use law. There are currently a batch of lawsuits under consideration, however, challenging this law. It will be increasingly important for organizations to track who commissioned derivative works and how those works are used internally and externally.
Governance: If you work in a regulated industry, you’ll need to show an audit trail of any data used in an AI tool and demonstrate that your organization is complying. A healthcare organization, for instance, would need to verify that no patient PII data has been leaked to an AI solution per HIPAA rules. This requires a governance framework for AI that covers privacy, data protection, ethics and more. Data management solutions help by providing a means to monitor data usage in AI tools and create a foundation for unstructured data governance.
In the next blog of this two-part series, I will describe two different pathways for using generative AI and how to adapt your unstructured data management and data governance practices accordingly:
- Curate Audit and Move (CAM), which manages feeding of corporate or domain-specific data to a pre-trained Large Learning Model for the best adaptation or;
- Use a pretrained LLM with prompt-based augmentation that you feed data to and manage across the SPLOG principles outlined above.