This is part 2 of a two-part series on AI and unstructured data management. Read part 1 here.
In many sectors, the latest generation of AI tools is creating excitement about their potential to change and improve many facets of work. In banking, generative AI technology could deliver value equal to an additional $200 billion to $340 billion annually and in retail and consumer packaged goods, the potential impact is $400 billion to $660 billion a year, according to new research by McKinsey. Work will become more efficient and less mundane. New innovations will get to market faster and help solve real societal problems.
Yet as with most new transformative technologies, there are downsides. In the case of generative AI, these downsides range from the leakage of sensitive, private and proprietary data to the rampant spreading of false or biased information, the production of faulty products that harm others and more sinister outcomes still: imagine an AI bot that can manipulate data to start a war or a deadly pandemic. AI industry leaders from companies including OpenAI and Google DeepMind have warned that AI could one day kill us all.
Existential threats are extreme: but the risks from unmanaged AI have already begun to appear as companies like Samsung experimented too early without proper guardrails. In the previous blog, I reviewed five key areas of AI data governance to consider when using generative AI solutions.
We call this SPLOG, for security, privacy, lineage, ownership and governance of unstructured data. It’s crucial to understand these risks and create a plan for managing these different areas before you implement a generative AI solution in your organization.
Next, how do you go about safely and efficiently using these new tools for competitive advantage? Today, we see two core approaches:
- Customize a LLM with corporate data using Curate Audit and Move (CAM): A custom approach which manages feeding of corporate or domain-specific unstructured data to a pre-trained Large Learning Model (LLM);
- Prompt a pre-trained LLM with corporate data using SPLOG: Use a pretrained LLM with prompt-based augmentation that you feed data to and manage across the SPLOG principles.
CAM: Curate, Audit and Move
The Curate, Audit and Move (CAM) approach entails creating a custom language learning model (LLM) which affords enterprises the ultimate control over their data and its protection while mitigating the risks of using public data sets. This involves selecting a third-party pretrained model, such as GPT 4, BERT, T5 or RoBERTa, and training it with your own data to create a proprietary LLM.
Building a custom LLM is a complex, resource-intensive task requiring specialized data science expertise and a robust computing infrastructure. The AI computing stack typically consists of high computing capacity (CPUs and GPUs), efficient flash storage, and appropriate security systems to protect any sensitive IP data used in the LLM.
Your team will also need to develop an unstructured data management workflow to identify, copy and move the right data to your LLM, provide an audit record of this so data scientists can later review it to investigate any issues or errors in the outcomes, and then delete or archive the data from high-performance storage upon project completion.
Due to the cost and time required to create and manage your own custom LLM, cloud providers are developing platforms to ease the process. Two of these include Azure Open AI and Amazon Sagemaker.
Adapt a Pretrained Third Party LLM
Most IT organizations will use a pre-trained model and SaaS application (such as ChatGPT) with their own data. This doesn’t require that you build an internal computing platform and you don’t need a team of data scientists to run it.
As covered in my earlier blog about the SPLOG process, this approach requires a heavy lift on the data governance and data management side of things. It’s critical to mitigate the risks of proprietary data leakage and security and privacy issues, while also navigating data ownership, transparency and data lineage factors.
A Data Management Framework for AI
IT leaders in concert with security, legal and data science experts should develop a data management framework for AI. Here are some top considerations for a framework and associated guidelines:
- Create employee guidelines for sending data to AI systems. Which data is sanctioned and for what kinds of research and use cases? Which data sets are off limits and secured so that individuals cannot access them to feed AI tools?
- What documentation and assurances can you obtain from AI vendors for handling of your data?
- What tools does the vendor offer to help mitigate data risk and have they been tested well enough for broad use? For instance, ChatGPT now allows users to disable chat history so that chats won’t be used to train its models.
- Can you segregate sensitive and proprietary data into a private, secure domain which restricts sharing with commercial applications?
- Maintain an audit trail of all corporate data that has fed AI applications.
- Track who commissioned derivative works from generative AI tools and how those works are used internally and externally, to protect against any lawsuits for copyright infringement.
- What additional tools and capabilities are needed to protect, manage and monitor unstructured data in AI applications?
Moving forward
In these early days of generative AI, it’s best to proceed cautiously with projects. It will be months before industry standards and regulations catch up.
Start by doing an assessment of your data assets and understand any potential liability issues as pertains to your data’s inclusion in an AI prompt. Spend time researching the vulnerabilities, limitations and any protections offered by an AI tool before implementing it. Discover the needs and top use cases for AI as the goals will determine the best possible AI solution. Not all projects are suited for generative AI, which is designed to create new content rather than do predictive analysis.
Keep an eye out for new software tools that can help filter outcomes for objectionable or inaccurate data sources, monitor security and privacy risks of your data to avoid leakage or privacy violations, or offer private sandbox environments for experimentation. An unstructured data management solution can also help with tracking if and how employees are using internal data in an AI system and provide holistic visibility into data assets and where they are stored.