Data is Piling Up in the Cloud, Data Centers and on the Edge
How do you easily search across these various data silos to find the data you want to analyze further and feed AI/ML applications?
Searching and prepping unstructured data is hard: data scientists spend an estimated 80% of the time finding, cleansing and organizing the data, not in the analysis.
Unstructured data (such as audio, video, images, genomics data, IoT data) is typically stored as files or as objects, both in file storage and cloud. It has no common structure and can easily be billions of files and objects strewn across many buckets, accounts and file stores.
Customers need a way to search across all storage and then mobilize and use the data through systematic data management. Komprise, an independent data management platform, delivers a Global File Index across all unstructured data along with a scale-out architecture, which means that customers can quickly find and act on specific data sets and set up automated policies. Think of it as a Google-like search across all your data repositories along with the ability to automatically move or execute actions on the results. To enhance and improve the search, customers need data tagging capabilities which enrich and refine data by adding metadata.
Why Unstructured Data Tagging Matters
Metadata makes it easier to find and manage data and take action. This is where data tagging comes into play and it’s a core feature in Komprise Intelligent Data Management. Tagging adds additional metadata to your file data in the form of key value pairs. These values give context to your data, allowing it to be easily found or associated with a project, study, or classification.
Example of tags: Country = US, Project ID = 123, HIPAA = TRUE
Tagging helps you become agile with your data: the ability to quickly find the exact files you want out of a sea of potentially hundreds of billions of files and then send data sets to analytics tools and data lakes on-premises, at the edge or in the cloud.
These tags can be applied either by data stewards / owners who may have intimate knowledge of the data and its business value or programmatically by analytics applications via API. This is valuable for research queries and analytics projects or to comply with regulations and policies.
Let’s look at a few examples of how tags can be leveraged with Intelligent Data Management:
Mergers and Acquisition: Recently two regional banks entered into a merger agreement. Part of this process involved moving massive amounts of data to different data centers and clouds. By tagging data sets with values that indicate the bank of origin and categories, the newly formed company can efficiently process and manage the data over its lifecycle.
Edge-to-Cloud: Lab instruments often generate terabytes of data which are stored in a NAS file system. This file system can simply be used as a daily cache and the data can be tagged and automatically tiered to the cloud as new data lands every day. The benefit of this approach is that lab data is available in the cloud, tagged, and thanks to Komprise Transparent Move Technology™ (TMT), natively accessed as objects. This means that users can import it for analysis with any cloud data analytics service and at a dramatically lower storage cost.
Improving Customer Support: A technology company used a machine learning program to run sentiment analysis on call center recordings. The results, such as customer satisfaction scores, are recorded to each audio file with a tag. Now employees can find relevant audio recordings for training and improve support efficiency.
Medical Imaging: A healthcare system may want to run machine learning on medical images and then tag image with diagnosis codes. Researchers can now quickly find images by diagnosis to support clinical projects.
Legal Hold: Legal discovery applications can find documents and file data related to litigation and then apply tags with the case ID. This data set can then be copied to immutable storage with retention policies to ensure the evidence is not altered or deleted.
Automotive: Data collected from self-driving cars can be tagged with information about driving conditions. Once tagged these data sets are useful to replay scenarios or generate synthetic data for further training. See a demo.
You may have some tags generated in industry specific applications such as Electronic Lab Notebooks (ELN) or Lab Information Management Systems (LIMS), but these tags do not propagate to the cloud and cannot be used outside of the specific application. Komprise provides a way to create, use and search based on standard metadata and tags no matter where your data lives, and maintain the information as data moves from one repository to another. You can also execute additional functions on data and enrich data with tags, which then persist as data moves from one system to another.
In our next blog we’ll dive into how you can process and tag your data where it lives without moving it to the cloud. We’ll include a use case where we copy a data set to the cloud for analysis. In both cases the data will be processed and tagged and with policy-driven automation. (VentureBeat article: How to create data management policies for unstructured data)