This interview originally appeared on Blocks & Files.
“In the last few years, generative AI’s large language models require vector embeddings to perform semantic search, and such vectors are generated from unstructured data, from the content,” writes Chris Mellor, editor of Blocks & Files. “Are vectors a kind of metadata? We explored these topics with Komprise CEO Kumar Goswami in an interview.”
Mellor: I could argue that the tokens and vector embeddings generated from a data item are metadata. What do you think about this idea?
Kumar Goswami: Metadata and vector embeddings are complementary but related. Since vector embeddings are a computer-understandable representation of file contents (“the what”) while metadata is valuable information about the file that can go well beyond file contents (“the why”), you need both. For example, say you want a chatbot to answer questions based on the most recent product features but you want it to only use public facing documents and not confidential internal documents, use metadata to exclude internal documents and non-final versions and run the vector embeddings and AI on just the right files.
We are focusing on gathering and globally managing metadata to enrich and narrow down data while empowering other tools and processes to consume and process the data as a whole. For example, you can enforce AI data governance and improve AI data quality by using Komprise to cull the files fed to Nvidia NeMo for embedding and running inferencing.
Mellor: Komprise says new tools can automatically analyze file contents and generate semantic tags at scale. What are semantic tags and how do they differ from vector embeddings?
Goswami: Vector embeddings are used to help AI understand the meanings of words in context while metadata provides semantic context for which files are relevant. For example, vector embeddings may help AI understand that the word “award” in the context of a research grant paper means getting a funding award and not winning a trophy. Metadata can be used to cull and curate all the documents related to a specific research topic by a specific researcher in a specific time frame to send to an AI agent that is helping write a grant application.
Mellor: What tools exist that automate finding and analyzing metadata?
Goswami: You need to index metadata across different storage and cloud environments and also act on it at scale. Komprise does both as our analysis extracts both system metadata and extended metadata such as sensitive data information into a global file index. This index retains the knowledge no matter where your data lives, and it does so without changing the original files. Komprise Deep Analytics helps you query and filter data based on this index and Komprise Smart Data Workflows allows you to search and feed the right data to the right AI process and retain its outputs as additional metadata.
Unlike traditional ETL, you need an ongoing workflow solution to find the right data, get it to the right compute, run the compute either locally or in the cloud, and then repeat this process again. You can use any AI or vector embedding or processor to enrich metadata further on your data in Komprise workflows. A great example of this is our customer Duquesne University.

Mellor: What AI tools are now available to extract pertinent information hidden in files and turn it into useful metadata that adds structure and context? How is the synthesis carried out?
Goswami: Anything that looks at file contents and generates outputs can be used via APIs in Komprise to enrich metadata. You can use cloud-based services like Azure AI Speech to inspect audio or Salesforce Einstein to find particular purchase orders in your CRM, and then have Komprise tag the files. That is the beauty of iterative workflows. You can use any process or tool to distill relevant metadata once you have a systematic way to manage the workflow.
Mellor: I understand Komprise thinks that automatic metadata from storage systems, while useful for basic operations, is just the start of a strategic metadata management program. Can you explain?
Goswami: There are many types of additional metadata, some of which are shown below. You could have users manually apply additional tags based on their knowledge. And, you can systematically automate applying tags at scale based on the artifacts from other processes as we have explained in prior answers. Enriched metadata becomes part of the data stored and indexed by an unstructured data management system. Such systems must be able to handle the scale of billions of metadata tags and persist these tags wherever the data lives and moves, to be effective. Komprise can do this today.
- Contextual metadata: Project identifiers, geographical tags, departmental associations, and business context that give meaning beyond technical properties. Some of this information can be extracted from applications, some from headers in files, and some via APIs from related applications (like getting the account identifier for a proposal from the CRM system).
- Sensitivity metadata: PII, intellectual property, regulated data type and security classifications. This requires specialized tools to uncover and classify, as it involves analyzing file contents rather than just properties.
- User-based metadata: Manual tags, collaborative annotations and crowd-sourced insights that add human intelligence to data classification. While powerful, this approach faces scalability challenges as data volumes explode.
- AI-generated metadata: The newest and most transformative category. AI analyzes file contents and automatically generates contextual tags and classification insights at scale.
Mellor: How can Komprise automatically identify and classify data based on business value, access patterns, and project requirements?
Goswami: Komprise offers automatic identification of sensitive data today in product, whether that is PII or keyword/regex search for a custom query. We also can work with any third-party AI tool to scan for different data types that uniquely identify data contents with tags that departmental users and data scientists need for projects. Culling and feeding the right data to AI is very important, regardless of whether the AI runs locally or in the cloud for three key reasons: a) it can be very costly to copy a lot of unnecessary data across environments, b) you don’t want to run expensive AI compute on irrelevant data or repeatedly on unchanged data, c) but most importantly, feeding the wrong data to AI could create data leakage and inaccurate results.
Mellor: How can Komprise help data scientists understand data lineage and ensure compliance with governance requirements?
Goswami: As Komprise moves the data to AI, it maintains an audit of what information was sent, and it tracks the lineage of where the data has been moved and where it came from. Increasingly, data governance is not just to comply with regulations but a corporate priority to prevent data leakage of corporate information. Komprise offers sensitive data detection and mitigation, orphaned and duplicate data search and deletion, and the ability to automate data management policies for different use cases. For instance, cold data tiering to immutable storage for ransomware protection or to ensure data that must adhere to regulations such as HIPAA and GDPR is stored and protected appropriately are great strategies to augment what cybersecurity teams are doing.
You can set up a Deep Analytics query to identify these protected data sets (PII, PHI) and automatically act on them if they are not handled properly by confining them, sending them to compliant storage and deleting them per regulatory requirement timelines.

Mellor: Komprise says sensitive data detection through metadata tagging for “PII” and other keywords helps find protected data that may be stored in non-compliant locations and secure it properly against cyberattacks. Can Komprise automate this process?
Kumar Goswami: Yes! You can select the file shares and directories to search, and then Komprise will scan them for any data that is PII such as names, birth dates, user IDs, driver’s license, social security numbers, credit card numbers, addresses. You can also use regex/keyword search to find IP data or other data deemed sensitive to your organization that doesn’t fit any standard definitions and this could include EmployeeID, PatientID for example. You can then use a Smart Data Workflow to take additional actions, such as to confine the data sets for manual review for legal hold or deletion and/or automatically move them to secure storage.

