How to Prep Unstructured Data for Cloud Analytics and AI

This article has been adapted from its original version on TDWI.

More than half of IT leaders report that their organizations are managing 5PB or more of data and most (68 percent) are spending more than 30 percent of their IT budget on data storage, backups and disaster recovery, according to the Komprise 2022 State of Unstructured Data Management.

[Update: The third-annual industry survey reports that preparing for AI is the leading data storage priority and unstructured data management challenge.]


Five petabytes is a lot of data (about 1.25 billion digital photos’ worth, for example) and much of it is unstructured, meaning that it doesn’t fit neatly into rows and columns in a database. This data — such as log files, IoT sensor data, microscopic data, user documents and medical images — is an untapped gold mine for the nascent field of unstructured data analytics.

With advances in cloud computing, machine learning (ML), and AI tools, unstructured data analytics is now a prime opportunity. Today there are a multitude of cloud-based ML and AI services for different use cases — from image and audio pattern recognition to personally identifiable information (PII) identification.

Some interesting and valuable use cases for unstructured data analytics include medical insurance fraud detection, autonomous vehicle testing, malicious actor detection, precision medicine and customer sentiment analysis of call center audio files.

Top Challenges for Unstructured Data Analytics

The Komprise survey showed that 65 percent of organizations plan to or are already investing in delivering unstructured data to their new analytics/big data platforms. To be successful in unstructured data analytics, you must jump through several hurdles compared to the relatively straightforward process of mining structured data in databases and spreadsheets. Gartner analyst Doug Laney introduced the 3Vs concept in a 2001 MetaGroup research publication, 3D Data Management: Controlling Data Volume, Variety, and Velocity.

When it comes to unstructured data, these challenges include:

Volume of data. Because there is so much data in organizations today, you can’t feasibly or affordably analyze it or copy it all to a cloud service or big data platform. Efficiently finding the right unstructured data across on-premises, edge, and cloud silos and then moving it to an analytics tool is a prominent hurdle today.  We addressed these challenges with the 2022 release of Smart Data Workflows. Adding to this pain is the prevalence of duplicate data. A research group may have teams of people working on the same data set and therefore multiple copies exist across different file shares and geographic locations.

Variety of data. Unlike structured data, unstructured data encompasses many different file types across video, audio, logs, lab notebooks, IoT and documents. Thus, understanding what types of files match with which data or cloud service is imperative so you’re always using the right tool for the right job. For example, looking for PII in documents is entirely different than finding all images that contain dogs. Different analytics techniques are needed to process different types of unstructured data.

Velocity of data. Data is piling up fast and because of its speed and volume, you can’t often act on it fast enough to place unstructured data into the appropriate storage technology or data lake for analysis. What comes to mind is the iconic “I Love Lucy” episode where Lucy and Ethel fail at their candy factory job once the conveyor belt speeds up, leaving no time to wrap the chocolates and resulting in plenty of waste. Businesses need automation to manage unstructured data because it is impossible to manually handle the velocity, variety and volume of this data.

Tagging and Automation Help Prep Data for Analytics

Addressing the challenges of unstructured data volume, variety and velocity begins with real-time knowledge on key data characteristics. IT managers also need  knowledge of cloud infrastructure and the big data analytics ecosystem across data centers, the edge and clouds.

Tactics may include:

  • The ability to preprocess data at the edge so it can be analyzed and tagged with new metadata before moving it into a cloud data lake. This can drastically reduce the wasted cost and effort of moving and storing useless data and can minimize the occurrence of data swamps.
  • Applying automation to facilitate data segmentation, cleansing, search and enrichment. You can do this with data tagging, deletion or tiering of cold data by policy and moving data into the optimal storage where it can be ingested by big data and ML tools. The Komprise survey found that the leading new approach to unstructured data management is the ability to initiate and execute data workflows.
  • Adopting an unstructured data management tool that persists metadata tags as data moves from one location to another. For instance, files tagged as containing PII by a third-party ML service should retain those tags indefinitely so that a new research team doesn’t have to run the same analysis over again — at high cost. Komprise Intelligent Data Management has these capabilities.
  • Planning appropriately for large-scale data migration efforts with thorough diligence and testing. This can circumvent common networking and security issues that derail the timely completion of moving data from one place to another. Read this blog post for tips.

In this economy, speed is a game changer. The faster you can feed quality data into your analytics platform, the faster you’ll get results and outcomes, and the less time you’ll spend doing it. Storage management and data management have finally converged, thanks to the demand for unstructured data analytics.

Getting Started with Komprise:

Contact | Data Assessment