Data Management Glossary
What is Unstructured Data?
Data can be of two broad types: structured data and unstructured data.
- Structured Data: Structured data is data that can be organized by structured categories, such as rows and columns in an Excel spreadsheet or a database. For example, accounting records are structured data because you can organize them by customer, by geography, by product, etc. Structured data is typically stored in a database and can be queried using query languages such as Structured Query Language (SQL). Most data was predominantly structured until 2000 but since then we have seen an explosion of unstructured data. Today, structured data accounts for less than twenty percent of the world’s data.
- Unstructured Data: Unstructured data is data that doesn’t fit neatly in a traditional database and has no identifiable internal structure. This is the opposite of structured data, which is data stored in a database. Up to 80% of business data is considered unstructured, with this number increasing year over year. Examples of unstructured data are text documents, e-mail messages, photos, videos, presentations, social media posts, and more.
Unstructured data usually does not include a predefined data model, and it does not match well with relational tables. Text heavy, unstructured data may include numbers and dates, as well as facts. This leads to difficulty in identifying this data using conventional software programs.
Unstructured data is the predominant data type that is generated by most applications today – from self-driving cars, to Internet of Things (IOT) devices, to genome sequencers, to video and audio files, most of the data we generate and use today is unstructured.
Why is Unstructured Data Growing so Fast?
The analyst firm IDC predicts that we will generate over 175 zettabytes of data by 2025 (one zettabyte is 4.4 Billion 1 terabyte drives!). They also predict that in the next three years we will generate more data than what we created over the past 30 years, and this growth trend will continue.
Most of the data we generate today is unstructured because unstructured data has several advantages over structured data:
- Wider Use Cases for Unstructured Data: Structured data has a rigid pre-defined structure and it can only be used for its intended purpose. This narrows the number of use cases for structured data – while it is useful for transactional applications like revenue tracking or catalogs, it is not a good use for applications that generate data that is not so easy to categorize such as video or genomics.
- Various Formats: Unstructured data can be stored in a variety of formats – from a mp4 video to a genomics BAM file to a .log diagnostics file to an X-RAY image that may be stored as a digital PACS format, all of these are types of unstructured data. So, an accurate way to describe unstructured data is that it has a variety of formats and not just one format. This means more applications can generate unstructured data and tailor the format to their use.
- Various Sizes: Unlike a cell in a database, unstructured data does not have to be a specific size or character limit. For example, you can have small video files for short snippets and large video files for full length movies. This also increases flexibility in how unstructured data is generated and used.
Since unstructured data is easier to create and use, more applications and users are working with unstructured data.
Unstructured Data Management
Managing growing volumes of unstructured data generated within an organization are leading to higher expenses.
What to know about unstructured data:
- Volume: The sheer quantity of data will continue to grow in a incomprehensible rate
- Velocity: The quantity of data is coming in at a continually faster rate
- Variety: The types of data continue to be more varied
These 3 Vs of unstructured data, originally defined by former Meta Group / Gartner industry analyst Doug Laney, means that managing unstructured data growth is critical for organizations as they find their budgets and resources are getting stretched to their limits.
Unstructured data management requires an understanding of what data is hot and actively used, and what data is cold and rarely accessed. In most enterprises, over 80% of unstructured data becomes cold within a year of creation – yet it continues to be managed on the most expensive storage and it continues to consume expensive backup resources. Analytics-driven data management of unstructured data can change this by identifying hot data and cold data across storage and managing hot data on expensive environments while offloading cold data to lower cost passive management. Unstructured data management should be done without restricting access to the cold data – so users and applications continue to see and access the cold data exactly as before, while the organization saves on cold data storage and backups. To understand how Komprise enables enterprise IT organizations to analyze, move, and manage unstructured data and save costs on storage, backup and cloud infrastructure read the white paper: Komprise Intelligent Data Management Architecture Overview.