Get the Flash Stretch Assessment. Maximize Tiering to Offset Price Hikes. Learn How

Interview: Komprise Transparent File Tables

Data storage managers and data engineers are finding an intersection in their roles to solve a core AI problem: making unstructured data usable in lakehouses and AI pipelines. The scale of the problem is daunting. IDC and Gartner estimate that 80-90% of enterprise data is unstructured. It’s growing three times faster than structured data yet most of it remains dark to AI. This is because it lacks the consistent schema that analytics and AI tools require to understand content and context. Komprise has developed a solution called Komprise Transparent File Tables (TFT).

Komprise COO and Cofounder Krishna Subramanian explains the story behind this offering and how it will cut costs and increase accuracy for AI workflows.

What is Komprise TFT?

Komprise Transparent File Tables (TFT) is a table in Apache Iceberg format that provides a structured, query-ready view of global enterprise unstructured data. Komprise maintains a Global Metadatabase containing system, content and custom metadata which provides classification and structure for all the data in the enterprise. TFT solves two major issues for data engineers and analysts by giving consistent schema to unstructured data and enabling its use in analytics and AI tools such as Databricks and Snowflake, without moving petabytes of data.

komprise-tft-snowflake-screen

What was the impetus for this new functionality and why was Apache Iceberg chosen for the format?

We all know that AI is nothing without high-quality data, and enterprise data is the differentiator when all enterprises have access to the same models. However, 99% of unstructured data is dark to AI because there is no easy way to query unstructured data. Komprise leverages the industry-standard Iceberg format to allow data analysts and data engineers to directly query unstructured data from tools of their choice without moving the raw data. Komprise always uses open formats such as NFS, SMB/CIFS, and S3/Object, and Iceberg is a natural choice for queries.

What are the existing solutions/methods to bring UD into AI/BI tools?

Current techniques to ingest data like ETL and ELT are built for structured and semi-structured (Eg JSON, CSV) data, because they focus on copying all the raw data without any schema extraction.

For instance, with file-based batch ingestion from cloud storage, files are staged in cloud object storage (S3, Azure Blob, GCS) and loaded into the platform either on a scheduled or incremental basis. Data lands as raw strings, binary blobs, or in semi-structured formats like JSON and XML, and is then refined through processing layers.

You can also keep files in cloud storage to query in place via external table references from lakehouses such as Databricks and Snowflake. This avoids data duplication and is practical for large unstructured stores where ingestion overhead or storage costs are a concern, though it is limited to cloud-native storage environments. This approach requires users to do the painful and difficult preprocessing to provide a structured description of the unstructured data so that it can be used by data lakehouses. Note that this is the pre-processing done by Komprise for the unstructured data that it manages.

Another common method is real-time streaming ingestion, using Kafka or Amazon Kinesis, but this is only available for event and log data. rather than traditional unstructured file types like documents, images, video or audio files.

These techniques could be used to ingest raw unstructured data into data lakehouses and then run processing in the lakehouse to extract schema. But there are several critical issues associated with these approaches.

What pains do existing solutions create for data engineers and analysts?

The above approaches are not ideal for enterprises managing large NAS footprints because they copy all the raw data from a source. Unstructured data has no enforced schema for classification and search, which means it is not in a queryable structure and lacks context for analysis and governance. It consists of a high degree of duplicate data, old and non-authoritative data and data that is dark because it has not been enriched with metadata that can help identify its contents. Furthermore, ingesting petabytes from multi-vendor storage, especially NAS, is complex, can take months to complete, and it is expensive to store and process the data in the lakehouse.

How does Komprise TFT work?

Enterprise IT can choose within the Komprise interface which subset of the Global Metadatabase they want to make available externally. The entire Global Metadatabase can be made available as an Apache Iceberg table or specific subsets. The IT administrator then loads these Komprise Transparent File Tables into Databricks, Snowflake, or other analytics tools of choice.

With this setup, any data team member including data analysts and data engineers and AI teams can simply query this table and operate upon it as they normally would with other tables. The results are filtered to just the data the particular user is authorized to see.

To summarize:

  • Komprise indexes enterprise data across hybrid cloud storage into a Global Metadatabase.
  • IT users can add rich context to files with content, header and sensitive data scanning and metadata tagging using  Komprise AI Preparation and Process Automation and Komprise Smart Data Workflows.
  • Enterprise data experts can then create Apache Iceberg queries using their preferred BI and analytics tools without knowledge of or access to Komprise nor the need for APIs.
  • Komprise provides data governance based on user access permissions.
  • If the full files are required for AI, Komprise Intelligent AI Ingest moves them at 2X the speed of standard data transfer tools.

How is Komprise TFT different than existing solutions/methods?

Existing solutions leave the hard work of processing unstructured data to the user which is non-trivial and results in poor quality data which erodes AI ROI. Ingesting unfiltered, raw data is time-consuming, expensive and cumbersome. Komprise overcomes these challenges by first infusing rich schema to unstructured data via metadata enrichment so that it can be precisely curated and queried. Komprise leaves the data in place, avoiding large-scale ingestion of raw data, while providing tabular access to it via Komprise Transparent File Tables. Komprise performs the rich metadata extraction, data cleansing and governance to classify and deliver context to data for AI and analytics.

Komprise vs. ETL Tools vs. Built-in Lakehouse Features

Capability Komprise ETL Tools Built-In Lakehouse Features
Purpose-built for unstructured data Yes
Designed exclusively for file and object data across NAS, cloud, and object storage, not adapted from structured data tools.
No
Primarily designed for structured and relational data pipelines; unstructured support is limited or bolted on.
No
Native connectors and ingestion focus on structured data; unstructured data requires custom work.
Petabyte-scale support Yes
Massively parallel, distributed architecture indexes billions of files and ingests at 2x the speed of standard data transfer tools.
Intelligent AI Ingest
Partial
Can handle large volumes but requires significant infrastructure investment and tuning for unstructured data at scale.
Partial
Scales well for structured data; petabyte-scale unstructured ingestion requires custom engineering.
No full data movement required Yes
Transparent Move Technology provides in-place access to files; data is dynamically loaded only when needed by AI.
Transparent Move Technology
No
Copies all source data to the destination by design, regardless of whether it is needed.
No
Data must be loaded into the lakehouse before it can be queried or used.
Surgical data filtering before ingest Yes
Rich filters eliminate irrelevant, outdated, and duplicate files before they reach AI pipelines, improving RAG accuracy and reducing inferencing cost.
Intelligent AI Ingest
No
Connectors blindly copy data from source to destination without quality filtering.
No
No pre-ingest curation; data quality and relevance must be handled upstream.
Built-in sensitive data detection Yes
Standard and custom PII and sensitive data classification with automated tagging prevents data leakage into AI tools.
KAPPA data services Intelligent AI Ingest
No
Requires separate DLP or data catalog tooling; no native sensitive data detection.
No
Governance features vary; sensitive data detection is a separate layer outside the platform.
Rich metadata enrichment Yes
Content, header, and AI-powered metadata extraction and tagging builds a queryable Global Metadatabase.
KAPPA data services Smart Data Workflows
Partial
Can pass through existing metadata but does not enrich, classify, or tag unstructured files.
Partial
Catalog features vary by platform; enrichment typically requires custom tagging pipelines.
Query without APIs or tool access Yes
Exposes unstructured data as Apache Iceberg tables directly in Snowflake, Databricks, and other BI tools. No Komprise access or APIs required.
Transparent File Tables
No
Data engineers must build and maintain query-ready pipelines; no native Iceberg exposure.
Partial
Queryable once loaded, but discovery, structuring, and schema definition are manual.
Automated governance and audit trail Yes
Automatically documents who, what, when, and data lineage for every ingestion workflow for compliance reporting.
Intelligent AI Ingest
No
Pipeline logging exists but end-to-end data lineage and compliance auditing require additional tooling.
Partial
Some platforms offer audit logging, but coverage for unstructured data lineage is limited.
Multi-vendor hybrid storage support Yes
Indexes NAS, cloud, and object storage across vendors in a single Global Metadatabase with no per-silo engineering required.
Partial
Supports many source connectors but each requires separate configuration and maintenance.
No
Typically limited to cloud-native storage within the platform’s own ecosystem.

What are some use cases for Komprise TFT?

AI-assisted metadata enrichment for annotated training datasets: A machine learning engineer at a healthcare provider can curate a high-quality dataset for fine-tuning a radiology LLM by querying a Komprise Transparent File Table enriched with AI-generated tags (modality, body part, study type, findings) extracted from DICOM files and their associated reports. The engineer can then join this with structured patient cohort data from the EHR to scope the right subset and export it as Parquet for ingestion into the RAG pipeline or fine-tuning workflow. Read about DICOM metadata extraction.

Data estate map across structured and unstructured sources: A data governance lead at a financial services firm can build a unified data estate map by querying a Komprise Transparent File Table for unstructured content across NAS, S3, and object stores, and joining it with structured catalog data from Snowflake or Databricks. This gives them a single view of where sensitive data lives, who owns it, and how it flows across systems, so they can prioritize remediation and compliance efforts across the entire data estate.

AI pipelines combining structured and unstructured data: An AI agent in media and entertainment helping with narrative alignment can use structured project data to identify relevant media archives and join this with Komprise Transparent File Tables to narrow down which scripts to ingest for summarization.

Analytics dashboards combining structured and unstructured data: A data analyst at a pharmaceutical company can create dashboards in Snowflake or Databricks for their drug research projects by querying a Komprise Transparent File Table for project files generated by each instrument and lab. The analyst can then join the data with financial tables from their ERP systems and instrument information from Benchling, thus combining structured and unstructured data from different sources in a single interface.

What are the top benefits for both ITOps/storage teams?

For IT teams, vast unstructured data estates stay in place but are accessible for data analytics teams building queries. This avoids the costly, complex process of sending large petabytes of data across hybrid infrastructure. Komprise scans sensitive data and tags it for compliance so that it cannot be sent to AI and lakehouses. Filtering data sets for high quality/authority, specific project requirements and sensitive data means that organizations can reduce costs and security risks while ensuring governance and improving business value.

What are the top benefits for data teams?

Data teams and AI get access to high-quality unstructured data through their familiar interface without the cost and complexity of ingesting raw data and figuring out how to extract schema.

How does this announcement expand upon what Komprise has been releasing in the past year?

Our core mission at Komprise is to deliver a platform for enterprise IT to maximize the value and efficiency of unstructured data. To this goal, Komprise has delivered several innovations to discover, classify, catalog, enrich, extract and give rich context and structure to unstructured data in the Komprise Global Metadatabase.

  • Komprise recently announced Komprise AI Preparation & Process Automation (KAPPA), which automates the task of creating and applying custom metadata tags that are needed by specific industries to classify and identify the right data for analytics and AI.
  • Combined with Komprise Smart Data Workflows, IT can pre-process data such as finding and eliminating duplicates, irrelevant, outdated or non-authoritative data and detecting/confining sensitive data.
  • Then, with KAPPA, analytics teams can procure high quality data specific to their project use case.
  • Automated data workflows can be configured to repeatedly find the right data, exclude the wrong data, tag data sets as needed, send to analytics and AI tools for processing and delete copies after jobs have completed to reduce storage costs.
  • In 2026, Komprise also announced its new Elastic Shares patent which uses dynamic partitioning techniques to overcome limitations of static load balancing approaches in large scale data movement jobs. Elastic Shares optimizes expensive data center resources with more efficient compute utilization.

Now, Komprise TFT builds on these innovations to make this high-quality schema for unstructured data available directly in data lakehouses and AI. Komprise TFT leverages the patented Komprise Transparent Move Technology to provide transparent access to raw data without having to move all the data.

What technical and security requirements are needed for IT to enable data teams?

IT users simply have to export the desired Komprise Transparent File Tables from Komprise and make them accessible in their data lakehouses. Komprise provides role-based access so users can only see the data they are authorized to see. But, if IT wants to restrict what data is made available to data teams, they can choose to run Komprise Deep Analytics Queries to create subsets of the Global Metadatabase to make available.

Learn more at Komprise.com/TFT

Getting Started with Komprise: