This article has been adapted from its original publication on ITProToday.
As AI transforms business operations, organizations need to focus on the data and, specifically, how to build efficient data pipelines to feed AI. The issue is that traditional data pipelines leveraging Extract, Transform, Load (ETL) were built for structured data and are fundamentally misaligned with AI’s needs.
ETL, which was designed for structured data from databases, no longer works in a world where 90% of data is unstructured and lives in files of many different formats and types. This data consists of documents, images, videos and audio files, instrument, and sensor data.
This shift in focus from data analytics of the past leveraging structured data to AI of today that requires large amounts of unstructured data demands a complete rethinking of how organizations prepare data for AI consumption.
The Unstructured Data Challenge
The core problem with unstructured data is its inherent lack of a common schema. You can’t take a video file, an audio file, or even three video files from three different applications and place them in a tabular format because they all have different contexts and different semantics.
An MRI medical image and a marketing photograph may share the same file extension, but they require unique metadata structures and processing approaches. As well, the same document format might need entirely different preprocessing depending on whether it’s being analyzed for legal compliance, customer sentiment, or research insights.
To make unstructured data usable, safe and searchable for AI pipelines, organizations need to accurately enrich metadata in ways that don’t require tedious, Sisyphean manual work. The metadata that storage systems automatically generate is limited: file type, creation date, author, modification date, size, last access date, and user ID.
- To enrich metadata, you first need a way to create a global file index of your unstructured data regardless of which storage or cloud houses the data.
- Once you have visibility, you can add tags manually with the help of departmental users who know their data and/or using AI and other automated tools.
- These new technologies — which can be standalone or exist within an unstructured data management platform — rapidly scan data sets and apply relevant tags describing their contents.
- This can identify sensitive data like personally identifiable information (PII) that must be excluded from AI workflows and add tags such as project code or research keywords that distinctly identify it for unique use cases. Komprise for Sensitive Data Management.
- As you catalog unstructured data, it is important to ensure that metadata can follow the data wherever it moves, avoiding the need to re-create metadata.
Copying and moving unstructured data to locations for AI analysis is also time-consuming and expensive, and due to the size of the data, it can take weeks to months. As a result, you only want to move the precise data sets that you need, further highlighting the need for metadata enrichment and classification.
Why AI Workflows Break the ETL Model
Beyond format challenges, AI processing itself fundamentally differs from traditional analytics. With AI, the workflows become iterative and non-linear.
For example, let’s say you want Amazon Rekognition to look at images and tag them, run PII detection to find and exclude sensitive data and then send data to a large language model (LLM) like Azure OpenAI for chat augmentation. You now have three different AI processes working on the same data at different points. This creates an AI-feeding-AI scenario where outputs from one process become inputs for another. Traditional ETL wasn’t designed for this cyclical enrichment process.
Additionally, AI introduces critical data governance challenges that are different from traditional analytics and unsupported by ETL, such as avoiding the exposure of sensitive data to commercial (external) AI services and maintaining clear audit trails of corporate data. Finally, there is a need to keep a record of what metadata was AI-enriched versus AI-enriched and human-verified.
Smart Data Workflows for AI
A modern approach to AI unstructured data preparation requires rethinking the entire data pipeline. Rather than immediately moving data, start by building a comprehensive metadata index that spans all storage environments. This delivers intelligent curation that identifies the exact subset of data for AI processing based on content, context, and business requirements. A global metadata index should be designed to retain metadata and tags no matter where the data lives, so it is independent of your storage.
This approach delivers significant advantages. In one real-world example, Duquesne University used Komprise and AWS Rekognition to first index and curate data to identify 10,000 relevant images out of three million files, cutting processing costs by 97%.
Komprise Smart Data Workflows delivers an automated process for unstructured data preparation and mobility:
- Global metadata indexing and curation: Discover and select relevant data before moving it, integrating with AI processors as needed for rapid content analysis and tagging.
- User tagging: Allow end users to tag their own data since they know it best.
- Iterative enrichment: Store results as reusable metadata to avoid redundant processing.
- Built-in AI data governance: Automatically detect sensitive information and maintain comprehensive audit trails.
There are several steps to follow on the path toward modern AI data preparation, including getting full visibility and analytics on unstructured data across storage silos, addressing data governance from the start, tracking AI data pipeline effectiveness with diverse use cases, and delivering departmental self-service capabilities for unstructured data classification.
As AI becomes central to business strategy, the organizations that implement smart data workflows will gain significant advantages in agility, cost efficiency, and risk management. The question isn’t whether your organization needs a new approach to unstructured data preparation for AI — it’s how quickly you can implement one.

