GUIDE TO UNSTRUCTURED DATA PREPARATION FOR AI
“Building an effective data management value chain can lead to powerful and game-changing benefits. Forward-looking data-driven companies are bringing in a product mindset, managing the data like a product across its entire life cycle.”– Deloitte
Overview: Is your data prepared for AI?
CIOs and other IT leaders are embroiled in the most disruptive wave of technological change of their careers as AI continues its unstoppable impact on daily work, life and society at large. The days of thinking that AI might settle out and wind down as just the latest overhyped trend are over.
There is quite a lot to consider: from building out the proper hybrid IT infrastructure, to reskilling IT staff, training employees, selecting the best tools and determining viable use cases for Generative AI and AI agents. At the heart of AI, of course, is the data. Most of today’s data is unstructured data: user files, chats and texts, images, video, sensor data, instrument data, and much more.
In this guide, we delve into the data challenges and requirements of deploying AI in the enterprise. For AI initiatives to scale and avoid negative outcomes, IT must lead with systematic processes to classify, govern, and manage unstructured data efficiently and securely. Unlike structured tabular data, unstructured data spans countless formats and sources, often with duplicate files, unknown sensitivity, and unclear ownership—making it complex to govern and prepare for AI.
Blog: Is Your Data Ready for AI Inferencing?
Understanding the risks and challenges of unstructured data for AI
GenAI, for all its transformative qualities in the workforce, has become a massive headache for CIOs. The security, liability and credibility risks of inappropriate, ungoverned use of these tools are no joke.
In the Komprise IT Survey: AI, Data & Enterprise Risk, IT leaders reported “extreme worry” about shadow AI, and 80% say their organizations have experienced negative outcomes from generative AI. A seemingly innocuous task of summarizing meeting notes in an AI tool could inadvertently expose sensitive customer and proprietary data to public LLMs. Avoiding sensitive data leakage to AI and protecting intellectual property and PII is the CIO’s top priority,
Preventing sensitive data from ingestion into AI pipelines is one issue. Culling only the right data for AI is equally imperative because AI is only as good as the data you feed it to achieve a particular goal. While it may seem simpler and easier to send large volumes of data into a data lakehouse to filter through and process later, this creates exorbitant storage and compute costs plus complexity.
Organizations are storing petabytes of data and billions of files, which includes rogue and irrelevant data. Further, you may have duplicate and near-duplicate copies of files that have been created over the years, adding to the cost and the sorting-out mess. Sending too much and/or the wrong unstructured data to locations for AI won’t deliver the results that data stakeholders want, either.
Watch the video: Komprise Data on the Move: Agentic AI and Unstructured Data
The truth is, most enterprise AI pilots don’t make it to production. Gartner estimates up to 60% will fail, often due to inadequate data readiness. IT leaders need to focus on addressing the following issues to prepare their unstructured data for AI.
What are the top barriers for AI data preparation?
1) Too many data silos with no central visibility and insights. Given that most organizations are storing data across multiple vendor systems from on premises to the cloud, it is difficult to understand, locate and access all the data needed for AI training and inference. This fragmentation can lead to incomplete or biased datasets. Silos may also result in the same data being copied and stored multiple times across different systems, which increases storage costs and adds confusion about which dataset is the “single source of truth” for AI usage. Preparing unstructured data for AI requires efficient tagging and movement to compute-ready platforms. Silos make it harder to automate or scale these processes across the enterprise. Understanding your file and object data is a foundational first step.
What can Komprise Analysis do for you?
2) Lack of unstructured data classification.
System-generated metadata for unstructured data is too basic to be useful when searching for and curating precise data sets for analytics projects. To make this data useful, it needs additional structure and context to aid rapid, precise data curation. Departmental users need easier ways to find the data they need, eliminating the need to dump large, irrelevant data sets into AI to process and filter. That adds unnecessary storage and compute costs and time. Yet classifying unstructured data by enriching metadata is often a manual, incomplete process that doesn’t scale. Large percentages of an organization’s data estate is therefore not discoverable and unavailable for AI, depleting competitive advantage.
Cracking the Code for Unstructured Data Classification
3) Incomplete AI data governance.
IT organizations need new policies and technologies for AI data governance. AI has introduced an entirely new set of risks and liabilities to organizations. AI is innovating quickly and IT leaders are struggling to keep up with the latest requirements to keep data safe and to avoid negative outcomes from AI projects. AI data governance is the framework, policies, and procedures organizations put in place to ensure that data used in AI systems is managed and used in a responsible, ethical, and compliant manner. Comprehensive AI data governance programs and tools cover:
- Sensitive data detection to avoid IP, PII and other private data leakage into commercial models;
- Provenance and transparency of training data;
- Data labeling or tagging for accuracy and consistency;
- Bias detection and mitigation in datasets;
- Auditing of AI model inputs and outputs;
- Human verification of AI derived works and decisions.
4) Achieving high unstructured data quality for AI is elusive.
With structured and semi-structured data, a common practice of the past has been to send files en masse to a data warehouse, data lake or data lakehouse where data engineers, data scientists and analysts can access it over and again for different projects. This model does not work for unstructured data which is much larger, expensive to store, heavier and difficult to move. A data lake with petabytes of unstructured data—billions of files and objects–becomes unwieldy data swamps that are hard to search. You’re also copying a healthy percentage of junk that will deliver poor and even dangerous results from AI, especially if it’s not filtered and classified before someone gives it to an AI prompt.
Getting data quality from AI demands a different approach. Your users need simpler ways to search and cull the right data in place before moving any data to a data lakehouse or AI engine.
TechVoices: Komprise’s Krishna Subramanian: AI and Data Management
5) Slow, difficult, costly process for feeding data to AI pipelines.
Per above, copying petabytes of data into other platforms and tools for AI is also financially untenable since this requires ample high-performance storage and AI compute processing. Even in the cloud, you could see annual IT infrastructure costs doubling or even tripling from AI. The iterative nature of AI workflows means that IT will need to move data to different processors repeatedly, multiplying your costs, especially if the data is retained after the processing is complete. Coping millions of files into an AI prompt engine is also not efficient nor even practical due to the processing time.
Duquesne University Finds and Tags Digital Images 99% Faster
5 tactics to manage and prepare unstructured data for AI
As organizations ramp up their use of AI, IT infrastructure teams are playing a greater role in preparing data for smarter, safer use. This means gaining clear visibility into file and object data across all systems, tagging and organizing it for AI workflows, and making sure sensitive information is not jeopardized.
The old ways of moving and preparing data don’t work well for unstructured data nor for AI’s complex needs. To succeed, teams need modern tools to classify, manage, and move only the data that matters—saving money, improving outcomes, and lowering security and privacy risks.
Check our blog channel on AI-Ready Data.
- Get unified visibility across data silos: Enterprise storage and backup tools have some data management insights and actions, but only for the data they maintain. Independent unstructured data management solutions can work across all your silos, index metadata to deliver insights on data growth, file types and sizes and user access trends and move data wherever you wish without lock-in. This saves money and time and ensures that you are managing data appropriately for its use case and value in the moment. You can integrate storage-agnostic unstructured data management with any desired tools for additional analytics or specialized functions such as metadata enrichment and PII protection. Your organization owns the data. You should be able to manage it as needed without vendor restrictions. Komprise is built on a global metadatabase which can be the hub for all of your unstructured data management actions, including preparing data for AI via metadata enrichment and automated Smart Data Workflows.
Adopt the appropriate data preparation modality for AI. The traditional extract, transform, load (ETL) model falls short for unstructured data used in AI because AI workflows are iterative, multistage, and nonlinear. Using a global metadatabase that indexes data with metadata tagging across all storage environments supports intelligent data curation. That means you’re only moving relevant content, such as pinpointing 10,000 images from 3 million documents. Additionally, AI introduces critical data governance challenges, including integrating human verification, that are uniquely different from traditional ETL processes. Instead, AI requires metadata indexing, user-driven data tagging, and built-in governance with sensitive data detection and lineage tracking. Komprise Smart Data Workflows deliver an easy UI to discover, enrich and classify data, confine sensitive data, move the right data to AI and even integrate third-party processors for specialized actions such as image identification. Preparing and moving data to AI is a complex process, involving diverse data types, and it demands new methods of data management. Read: Preparing unstructured data for AI? Forget ETL.- Power AI with the right data at the right time. With full visibility, analysis and a system to query across all data, your departments can create repeatable, curated unstructured data pipelines to AI. Use an unstructured data management solution that supports user-based tagging, such as clinical researchers tagging files by demographics and diagnostic codes. AI-based content indexing tools can inspect files and tag them rapidly and accurately. By bringing specificity to AI data workflows, employees can send the right files and no more to AI. They will speed up AI projects, obtain more accurate outcomes and IT will avoid expensive data swamps that aren’t usable. Furthermore, with enhanced classification, IT can deliver better data services to users. Data classified as “cold” or “zombie” or “duplicate” can be tiered to low-cost storage or deleted altogether, saving money and reducing the active attack surface for ransomware attacks.
Deliver trusted data for AI (PII) Storage IT professionals need to ensure that sensitive data — such as PII and IP — are unavailable for user search and ingestion by AI pipelines. Komprise Smart Data Workflows delivers both standard PII detection and custom (regex and keyword) sensitive data detection. After detection, the solution automatically tags it in the metadatabase (global file index) and IT can set policies to confine or move data to a safe location. You can set up automated workflows to identify and exclude sensitive data from the data that is searchable and available for AI ingestion. Finally, Komprise maintains a full audit record of all data processed by any workflow, so you can investigate any issues or concerns from AI outcomes. Read more about sensitive data management.- Revisit skills/staff requirements for AI. Storage IT professionals are increasingly managing data movement and access across complex hybrid cloud and multi-vendor environments while addressing security threats from AI and cyberattacks. They need new tools and tactics to manage infrastructure and govern data workflows for AI. Key strategies include:
- Establish processes for departmental collaboration on AI and analytics initiatives to understand new requirements.
- Track metrics like data volume, growth, cold and hot data, data access trends and more;
- Use FinOps capabilities in unstructured data management to optimize storage and move cold data to cost-effective tiers;
- Mitigate risks of corporate data from ransomware by using immutable storage in the cloud for inactive data.
- Deliver AI-ready storage and compute resources (CPUs, GPUs, TPUs) to support model training and deployment.
- Prepare data for analytics and AI use with automated workflows and data classification techniques and deliver rapid search and tagging capabilities for department managers.
- Protect sensitive data from leaks by segregating private data, implementing audit trails, and establishing governance frameworks.
Learn more about Smart Data Workflows and AI-Ready Data from Komprise.
What is unstructured data management?
What is AI data preparation?
What is AI data management?




Adopt the appropriate data preparation modality for AI. The traditional extract, transform, load (
