Data Management Glossary

Back

Data Tagging

What is data tagging?

Data tagging is the process of adding metadata to your file data in the form of key value pairs. These values give context to your data, so that others can easily find it in search and execute actions on it, such as move to confinement or a cloud-based data lake. Data tagging is valuable for research queries and analytics projects or to comply with regulations and policies.

How does Komprise data tagging work?

Users, such as data owners, can apply tags to groups of files and tags can also be applied programmatically by analytics applications via API. In the Komprise Deep Analytics interface, users can query the Global Metabase and find the data for tagging. This is done by creating a Komprise Plan or Smart Data Workflow that will invoke the text search function to inspect and tag the selected files. The ability to use Komprise Intelligent Data Management to search, find, apply tags and then take action makes it possible for customers to get faster value from enriched data sets.

Tagging and Smart Data Workflows

Komprise Smart Data Workflows automate unstructured data discovery, data mobility and the delivery of data services.

Define custom query to find specific data set.
Analyze and tag data sets with additional metadata
Move only the tagged data for analytics, AI/ML, etc.
Move to a lower-cost data storage tier after analysis

What is data tagging and why does it matter for unstructured data management?

Data tagging is the process of assigning descriptive labels or metadata attributes to files, objects, and other data resources to make them easier to organize, search, classify, and act on at scale. For unstructured data, which lacks inherent schema or structure, tagging is the primary mechanism for adding context and meaning to files that would otherwise be invisible to search, analytics, and AI systems. Key reasons data tagging matters:

Searchability — tags make it possible to find specific files across petabyte-scale data estates using business-relevant criteria such as project name, patient ID, sensitivity status, or content type
AI readiness — AI models require precise, well-labeled datasets; tagging enables IT teams to identify and deliver exactly the right subset of files for each AI use case without moving entire datasets
Governance and compliance — sensitivity tags, legal hold labels, and retention classifications enable automated policy enforcement across all storage tiers
Workflow automation — Komprise Smart Data Workflows automate unstructured data discovery, data mobility, and the delivery of data services, using tags to define custom queries, analyze data sets, and move only the tagged data for analytics, AI and ML use cases
Persistent value — tags applied to files should persist as data moves from one location to another, so that a new research team does not have to run the same analysis over again at high cost

What is the difference between manual and automated data tagging, and which is better for petabyte-scale unstructured data?

Manual tagging relies on users assigning labels based on their knowledge of file content. Automated tagging uses algorithms, machine learning, NLP, or custom code to extract and assign tags programmatically. At petabyte scale, manual tagging is not viable. Automated tagging is the only practical approach for enterprise unstructured data estates. Key distinctions:

Manual tagging works for small, well-understood datasets where human judgment adds value, such as curating a specific research collection or applying legal hold labels to a defined set of files
Automated tagging scales to billions of files across hybrid storage silos, applying consistent labels based on file content, header metadata, access patterns, sensitivity status, or custom business logic without human intervention
KAPPA data services — KAPPA Data Services extend automated tagging to proprietary and industry-specific file formats at petabyte scale using serverless processing; a few lines of Python can extract DICOM header attributes, genomics BAM file metadata, or ERP project codes and write them as searchable tags to the Global Metadatabase
API-driven flexibility — Komprise data tagging is flexible and customizable, using APIs to apply tags wherever it is most convenient, including at the edge or data center before data moves to the cloud
Third-party AI integration — Komprise Smart Data Workflows connect to services such as AWS Rekognition to analyze file contents and apply AI-generated tags, which Komprise then persists in the Global Metadatabase no matter where the data moves (read the customer story)

How does the Komprise Global Metadatabase store, manage, and extend data tags across petabyte-scale unstructured data estates?

The Komprise Global Metadatabase is a continuously updated, unified index of standard and enriched metadata across all NAS, cloud, and object storage environments, and it is the foundation on which all Komprise data tagging capabilities are built. How it works:

Automatic indexing — when Komprise points at different file and object repositories, it automatically indexes all standard metadata and creates a Global File Index, capturing file names, types, sizes, owners, creation and access dates, and directory structures across every silo
Enriched metadata storage — all tags applied through Komprise Deep Analytics queries, KAPPA functions, third-party AI services, or direct API calls are written back to the Global Metadatabase and persist regardless of where the underlying file moves
Tag-based search and curation — a metadatabase can manage all data tags at scale and provide a simple, rapid way for users to search data based on tags and take actions accordingly, enabling IT teams to identify precise datasets for AI, compliance, or cost optimization in seconds
Sensitivity and governance tags — IT users can ensure sensitive data is segmented from AI data workflows and stored in compliant locations by applying sensitivity tags, and for AI, tags can entail keywords describing file contents such as medical diagnosis or seismic data so that precise data sets can be culled for model training or inferencing
Durable across data movement — tags stored in the Global Metadatabase remain intact when files are tiered, migrated, or copied, ensuring enrichment work is never lost and does not need to be repeated

How does data tagging improve AI accuracy and reduce the cost of AI data pipelines for unstructured data?

Untagged, ungoverned unstructured data fed into AI pipelines degrades model accuracy, wastes GPU compute on irrelevant files, and creates compliance risk from sensitive data leakage. Data tagging solves all three by creating a queryable, governed layer of context above the raw file estate. The Komprise approach:

Filter before you move — metadata enrichment and classification using intelligent tagging, NLP, and PII detection allows data teams to classify data by content, owner, usage, and risk before any data reaches an AI pipeline, eliminating data noise at the source
Reduce dataset size dramatically — a healthcare organization can tag petabytes of medical imaging by diagnosis code, patient demographics, and study type, then query those tags to reduce a dataset of millions of files to tens of thousands before AI ingestion, cutting cloud compute and egress costs by 96% or more
KAPPA for domain-specific tagging — KAPPA data services use serverless processing to extract and apply custom tags from proprietary clinical, scientific, and enterprise file formats that standard tools cannot read, enriching the Global Metadatabase with domain-specific context that AI models need to produce accurate results (read the press release)
Sensitive data exclusion — metadata tagging and enrichment allows data owners to add context and structure to unstructured data so that it can be easily discovered and segmented, with sensitivity tags triggering automatic exclusion of PII, PHI, and IP from AI workflows before ingestion
Tags persist through the AI lifecycle — Komprise applies all tags back to the Global Metadatabase even when a cached copy is used for AI, ensuring the original data is continuously enriched and governance records remain current

Why should enterprises tag unstructured data at the edge or on-premises rather than waiting until data reaches the cloud?

Tagging data at the edge, before it moves to the cloud, delivers faster AI results, lower egress costs, and better data quality at the destination. It is also the only practical approach for datasets where full-file movement is cost-prohibitive. Key reasons to tag early and locally:

Network efficiency — network restrictions including limited bandwidth, long latency, and costs can make sending massive datasets over the internet problematic; searching and narrowing just the right data before sending it to the cloud can speed up data analytics significantly
Application-specific metadata — custom or industry-specific applications such as Lab Information Management Systems will have their own set of metadata for files, for example a device ID in microscopy image files, which will only be available by querying the applications at the local data center
Zero-move tagging — Komprise indexes and tags files in place across any storage silo without moving them, so enrichment work is complete before any data movement decision is made
Komprise Observer architecture — Komprise virtual appliances deployed adjacent to the data analyze and tag files using custom policies, operating on petabytes of data across distributed data centers and cloud environments without impacting storage performance or application access
Continuous and automated — once queries and policies are designed in Komprise, the platform automatically and continuously executes data pipelines, so as new files are created they are found, tagged, and acted on without manual intervention

———

Want To Learn More?

Data Management Glossary

Data Tagging

What is data tagging?

How does Komprise data tagging work?

Tagging and Smart Data Workflows

What is data tagging and why does it matter for unstructured data management?

What is the difference between manual and automated data tagging, and which is better for petabyte-scale unstructured data?

How does the Komprise Global Metadatabase store, manage, and extend data tags across petabyte-scale unstructured data estates?

How does data tagging improve AI accuracy and reduce the cost of AI data pipelines for unstructured data?

Why should enterprises tag unstructured data at the edge or on-premises rather than waiting until data reaches the cloud?

Related Terms

Getting Started with Komprise:

Platform

Industries

Use Cases

Resources

Company

Resellers