Unstructured Data Management In the Age of Generative AI
Guidance for the Next Generation of Unstructured Data Management Challenges
Despite the fact that rules and standards remain in development, it’s possible to establish basic data management principles when working with the unstructured data that feeds AI tools.
The paper reviews:
- AI data management principles included in the security, privacy, lineage, ownership, governance (SPLOG) framework;
- How to protect and segment data in generative AI;
- How to track and audit data in generative AI;
- The importance of guardrails for employees;
- The role of unstructured data management in AI.
How generative AI technology and the laws and industry standards relating to it will evolve is a work in progress. Yet it’s clear that the ability to define and enforce basic data governance standards will be paramount for taking full advantage of AI solutions without assuming unnecessary risks. Read this paper to review data management strategies for successful GenAI initiatives.
Download this unstructured data management white paper to learn more.
Why has generative AI made unstructured data management a strategic priority rather than an infrastructure concern?
When the Komprise white paper on unstructured data management in the generative AI era was first published, the case still needed to be made that generative AI and unstructured data were connected. That case no longer needs to be made. The top business challenge for unstructured data management is now reducing data risk from AI, cited by 62% of IT leaders, and the top challenge in prepping data for AI is classifying and tagging, cited by 56%. The shift from infrastructure concern to strategic priority has happened for three compounding reasons:
- AI runs on unstructured data — every generative AI use case from building a chatbot on internal knowledge to deploying AI agents for customer support requires unstructured data as its input; customer call recordings, medical images, and sensor data from self-driving cars hold the key to making AI valuable at enterprise scale, and none of this data lives in structured databases
- The stakes are financial, not just technical — nearly 80% of IT leaders say their organization has experienced negative outcomes from using corporate data with AI, with 13% reporting financial, customer, or reputational damage; the question is no longer whether to govern unstructured data for AI but how urgently
- Data volumes have crossed a threshold — 74% of organizations are now storing more than 5PB of unstructured data, a 57% increase over 2024; at this scale, manual curation is not viable and the risks of ungoverned AI access are compounded by the sheer volume of sensitive content across the estate
- Governance is the foundational requirement — the ability to define and enforce basic data governance standards will be paramount for taking full advantage of AI solutions without assuming unnecessary risks; the white paper’s governance framework remains the correct foundation, but the tools and urgency to implement it have accelerated dramatically
- Storage costs compound the urgency — with flash and NAND prices rising 130% by end of 2026 according to Gartner, the same ungoverned data estates that create AI risk are also driving unsustainable storage costs; addressing both simultaneously through intelligent data management is now a board-level conversation, not just an IT infrastructure decision
What is the Curate, Audit, Move framework for AI data preparation and why does it still define best practice in 2026?
The Curate, Audit, Move (CAM) framework introduced in the Komprise generative AI white paper describes the three essential steps for preparing unstructured data for any AI use case: curating the right data from across the enterprise, auditing it for quality, sensitivity, and compliance, and moving only the right subset to the AI pipeline. CAM remains the correct architecture — what has changed is the tooling available to execute it at petabyte scale. How CAM maps to current Komprise capabilities:
- Curate — finding the right data across every silo — the Komprise Global Metadatabase continuously indexes all unstructured data across NAS, cloud, and object storage, capturing standard and enriched metadata including file type, age, owner, project code, and classification tags; the Global Metadatabase delivers a surgical approach with rich filters to find just the data needed, unlike traditional ETL and data ingestion approaches that blindly copy data from a source
- Audit — governing sensitive data before AI ingestion — Komprise Sensitive Data Management scans for PII, PHI, and IP using built-in pattern scanners, custom regex, and KAPPA Data Services that extract domain-specific sensitive content from proprietary file formats; using Komprise Smart Data Workflows, organizations can automate the curation, search, tagging, and movement of data to the right locations for use in data lakes and AI tools
- Move — delivering curated data efficiently — Komprise Intelligent AI Ingest delivers the curated, governed dataset to any AI stack 2x faster than standard transfer tools, filtering out 70%+ of data noise before it reaches the AI pipeline; this directly addresses the greatest challenge in preparing unstructured data for AI: finding and moving the right data to locations for AI ingestion
- The framework now applies to agentic AI — as AI agents increasingly invoke data retrieval autonomously at runtime, the CAM framework must operate continuously, not just at project setup; Komprise KAPPA Data Services can be invoked directly by AI agents at runtime, enabling just-in-time curation on demand
- CAM also addresses shadow AI — by ensuring that only curated, audited, and governed data reaches AI pipelines, CAM creates the conditions under which shadow AI incidents become detectable and preventable; data that has been classified and tagged cannot silently flow into unauthorized AI tools
What are the core AI data governance principles enterprise IT teams must implement, and how has the regulatory environment evolved since the white paper was published?
The white paper identified security, privacy, lineage, ownership, and governance as the five foundational principles for AI data management. How generative AI technology and the laws and industry standards relating to it will evolve is a work in progress; yet the ability to define and enforce basic data governance standards will be paramount for taking full advantage of AI solutions without assuming unnecessary risks. In 2026, the regulatory landscape has clarified considerably:
- Security is now the top enterprise concern — the greatest data concern for generative AI is security, specifically corporate data leakage, cited by 46% of IT leaders; preventing sensitive data from reaching unauthorized AI tools has moved from a compliance nicety to an operational priority enforced by incident history
- Data lineage is an audit requirement — knowing what data was used to train or inform an AI model, where it came from, and who authorized its use is now a prerequisite for compliance with emerging AI regulations in the EU, HIPAA enforcement guidance on AI, and SEC requirements for financial services AI deployments
- Shadow AI has made ownership critical — employees feeding sensitive data into tools like ChatGPT, GitHub Copilot, or private AI apps creates ownership ambiguity that is difficult to resolve after the fact; the white paper’s principle that data ownership must be established before AI deployment has proven exactly correct
- Privacy regulations have teeth — GDPR enforcement actions against AI tools have accelerated, CCPA has been strengthened, and sector-specific regulations including HIPAA and FERPA are being actively applied to AI use cases; the white paper’s privacy principles are now regulatory requirements in many jurisdictions, not optional best practices
- Komprise enforces governance automatically — rather than relying on policy documents and employee training, Komprise Smart Data Workflows enforce classification, sensitivity tagging, access controls, and audit trails automatically across petabyte-scale data estates, making governance a continuous automated process rather than a periodic manual review
Which types of unstructured data are most valuable for enterprise AI initiatives and what does effective preparation look like in practice?
The white paper covered the breadth of enterprise unstructured data types relevant to AI. In 2026, the specific use cases have become considerably more concrete and the tooling to address them has matured substantially. The data types that are generating the most active AI investment:
- Medical and clinical imaging — DICOM files, whole-slide pathology images, and radiology studies are the foundation of clinical AI; NewYork-Presbyterian used Komprise to achieve 10x faster AI data ingestion and 96% lower cloud costs for its digital pathology AI program by filtering petabytes of imaging data to exactly the right cohort before ingestion (read the case study)
- Research and genomics data — BAM, FASTQ, and proprietary instrument output files from life sciences and genomics research require custom metadata extraction that standard tools cannot provide; KAPPA data services extract domain-specific attributes from these formats at petabyte scale using serverless processing
- Documents, contracts, and legal records — large language models trained on or querying internal document repositories can surface confidential information if sensitive documents are not classified and excluded upstream; organizations need solutions that allow them to find, monitor, secure, and manage unstructured data of all types to ensure AI tools can generate insights while protecting organizations from data leakage, privacy and ethics violations and even lawsuits
- Audio, video, and media assets — call recordings, training videos, and product media are increasingly used in multimodal AI; the same classification and governance framework applies but requires format-specific metadata extraction
- Engineering and sensor data — CAD files, simulation outputs, and IoT sensor logs are the unstructured data of manufacturing AI; the Global Metadatabase indexes these files alongside all other data types, making them discoverable for AI workflows without requiring separate data pipelines per format
- The common requirement across all types — every unstructured data type requires the same upstream preparation: cross-silo discovery via the Global Metadatabase, noise filtering, sensitive data exclusion, metadata enrichment, and governed ingestion; the data type changes, the framework does not
How can enterprise IT teams evolve from managing unstructured data as a storage cost problem to leveraging it as an AI competitive advantage?
The white paper’s central argument was that enterprises sitting on petabytes of unstructured data were sitting on an unrealized AI asset. That argument has been validated by every subsequent market development. A bank that wants to detect fraud beyond what traditional monitoring allows, a healthcare system seeking to accelerate diagnoses, or a manufacturer looking to predict equipment failure all depend on the unstructured data their organizations have been generating for decades. The evolution from cost management to competitive advantage follows a clear sequence:
- Start with visibility — you cannot curate what you cannot see; the Komprise Global Metadatabase provides a unified, continuously updated index of all unstructured data across every storage silo, making the full enterprise data estate visible and queryable for the first time; this is the prerequisite for every subsequent AI initiative
- Cut costs to fund AI — Komprise Flash Stretch tiers cold data off expensive primary storage transparently, reclaiming 70%+ of NAS capacity without disruption; the storage cost savings this generates directly fund the AI infrastructure investment that the 2026 survey shows is now every enterprise’s top IT priority
- Classify and govern proactively — IT organizations need to rethink or update their unstructured data management strategies to find, monitor, secure, and manage unstructured data of all types and across all locations in an efficient and cost-effective manner; that is the only way to ensure that generative AI tools can generate insights while protecting organizations from data leakage, privacy violations and lawsuits
- Automate the AI pipeline — Komprise Intelligent AI Ingest vastly reduces processing costs and time rather than blindly copying large volumes of unstructured data to AI; contextual curation improves the accuracy of AI results by ensuring models train on relevant, high-quality, governed data rather than everything in the estate
- Build for agentic AI — the next evolution beyond RAG pipelines is agentic AI systems that retrieve and act on data autonomously at runtime; the Komprise Global Metadatabase, Smart Data Workflows and KAPPA data services are designed to be invoked by AI agents directly, making the governed enterprise data estate a real-time resource for AI rather than a periodic batch input; organizations that build this foundation now will have a structural advantage as agentic AI matures throughout 2026 and beyond
