Unstructured Data

What is Unstructured Data?

Data can be of two broad types: structured data and unstructured data.

  • Structured Data: Structured data is data that can be organized by structured categories, such as rows and columns in an Excel spreadsheet or a database. For example, accounting records are structured data because you can organize them by customer, by geography, by product, etc. Structured data is typically stored in a database and can be queried using query languages such as Structured Query Language (SQL). Most data was predominantly structured until 2000 but since then we have seen an explosion of unstructured data. Today, structured data accounts for less than twenty percent of the world’s data.
  • Unstructured Data: Unstructured data is data that doesn’t fit neatly in a traditional database and has no identifiable internal structure. This is the opposite of structured data, which is data stored in a database. Up to 80% of business data is considered unstructured, with this number increasing year over year. Examples of unstructured data are text documents, e-mail messages, photos, audio and video files, CAD / CAM files, genomics sequencing data, medical images,  presentations, IoT and machine-generated data, log files, user documents stored across teams and departments, and much more.


Unstructured data usually does not include a predefined data model, and it does not match well with relational tables. Text heavy, unstructured data may include numbers and dates, as well as facts. This leads to difficulty in identifying this data using conventional software programs.

Unstructured data is the predominant data type that is generated by most applications today – from self-driving cars, to Internet of Things (IOT) devices, to genome sequencers, to video and audio files, most of the data we generate and use today is unstructured.

Why is Unstructured Data Growing so Fast?

The analyst firm IDC predicts that we will generate over 175 zettabytes of data by 2025 (one zettabyte is 4.4 Billion 1 terabyte drives!). They also predict that in the next three years we will generate more data than what we created over the past 30 years, and this growth trend will continue.

Most of the data we generate today is unstructured because unstructured data has several advantages over structured data:

  • Wider Use Cases for Unstructured Data: Structured data has a rigid pre-defined structure and it can only be used for its intended purpose. This narrows the number of use cases for structured data – while it is useful for transactional applications like revenue tracking or catalogs, it is not a good use for applications that generate data that is not so easy to categorize such as video or genomics.
  • Various Formats: Unstructured data can be stored in a variety of formats – from a mp4 video to a genomics BAM file to a .log diagnostics file to an X-RAY image that may be stored as a digital PACS format, all of these are types of unstructured data. So, an accurate way to describe unstructured data is that it has a variety of formats and not just one format. This means more applications can generate unstructured data and tailor the format to their use.
  • Various Sizes: Unlike a cell in a database, unstructured data does not have to be a specific size or character limit. For example, you can have small video files for short snippets and large video files for full length movies. This also increases flexibility in how unstructured data is generated and used.

Since unstructured data is easier to create and use, more applications and users are working with unstructured data.

Unstructured Data Management

Managing growing volumes of unstructured data generated within an organization are leading to higher expenses.

What to know about unstructured data:

  • Volume: The sheer quantity of data will continue to grow in a incomprehensible rate
  • Velocity: The quantity of data is coming in at a continually faster rate
  • Variety: The types of data continue to be more varied

These 3 Vs of unstructured data, originally defined by former Meta Group / Gartner industry analyst Doug Laney, means that managing unstructured data growth is critical for organizations as they find their budgets and resources are getting stretched to their limits.

Unstructured data management requires an understanding of what data is hot and actively used, and what data is cold and rarely accessed. In most enterprises, over 80% of unstructured data becomes cold within a year of creation – yet it continues to be managed on the most expensive storage and it continues to consume expensive backup resources. Analytics-driven data management of unstructured data can change this by identifying hot data and cold data across storage and managing hot data on expensive environments while offloading cold data to lower cost passive management. Unstructured data management should be done without restricting access to the cold data – so users and applications continue to see and access the cold data exactly as before, while the organization saves on cold data storage and backups. To understand how Komprise enables enterprise IT organizations to analyze, move, and manage unstructured data and save costs on storage, backup and cloud infrastructure read the white paper: Komprise Intelligent Data Management Architecture Overview.

Unstructured Data Migration Challenges

Migrating unstructured data to the cloud has grown in popularity to save data storage costs, consolidate data centers, modernize IT infrastructure and take advantage of cloud-based services such as AI, ML and analytics. But there are many challenges when it comes to unstructured data migrations to the cloud, including:

  • A global enterprise typically has billions of predominantly small files, which have significant overhead, causing data transfers to be slow.
  • Server message block (SMB) and NFS protocol workloads, which can be user data, electronic design automation (EDA) and other multimedia files or corporate shares, are problematic since the protocol requires many back-and-forth handshakes which increase traffic over the network. The SMB protocol in particular, is known to to have WAN transfer performance challenges, meaning cloud migrations can take much more time than IT organizations anticipate if not done correctly.
  • File protocols are sensitive to high-latency network connections, which are unavoidable in WAN migrations.
  • Bandwidth is often limited or not always available, causing cloud NAS migration data transfers to become slow, unreliable and difficult to manage.
25 times faster unstructured data migrations with Hypertransfer

AI Needs Unstructured Data

In a 2022 blog post, Komprise co-founder and CEO wrote about unstructured data management as the foundation for artificial intelligence (AI) and machine learning (ML) initiatives.

Enterprises need to be ready for this wave of change and it starts by getting unstructured data prepped, as this data is the critical ingredient for AI/ML. This entails new data management strategies which create automated ways to index, segment, curate, tag and move unstructured data continuously to feed AI and ML tools. Unforeseen changes to society, fueled by AI, are coming soon and you don’t want to be caught flat-footed.

AI needs unstructured data

Want To Learn More?

Related Terms

Getting Started with Komprise:

Contact | Data Assessment