Data Management Glossary
Unstructured Data
Data can be of two broad types: structured data and unstructured data.
- Structured Data: Structured data is data that can be organized by structured categories, such as rows and columns in an Excel spreadsheet or a database. For example, accounting records are structured data because you can organize them by customer, by geography, by product, etc. Structured data is typically stored in a database and can be queried using query languages such as Structured Query Language (SQL). Most data was predominantly structured until 2000 but since then we have seen an explosion of unstructured data. Today, structured data accounts for less than twenty percent of the world’s data.
- Unstructured Data: Unstructured data is data that doesn’t fit neatly in a traditional database and has no identifiable internal structure. This is the opposite of structured data, which is data stored in a database. Up to 80% of business data is considered unstructured, with this number increasing year over year. Examples of unstructured data are text documents, e-mail messages, photos, audio and video files, CAD / CAM files, genomics sequencing data, medical images, presentations, IoT and machine-generated data, log files, user documents stored across teams and departments, and much more.
Unstructured data usually does not include a predefined data model, and it does not match well with relational tables. Text heavy, unstructured data may include numbers and dates, as well as facts. This leads to difficulty in identifying this data using conventional software programs.
Unstructured Data Types by Industry
Unstructured data is the predominant data type that is generated by most applications today – from self-driving cars, to Internet of Things (IOT) devices, to genome sequencers, to video and audio files, most of the data we generate and use today is unstructured. Here are some examples by industry:
- Life Sciences: Imaging, genome sequencing, research
- Healthcare: Imaging, PACS, digital pathology
- Media & Entertainment: Post-production, animation, VFX, content delivery
- Government: CAD/CAM, GIS, bodycam surveillance
- Oil and Gas: Seismic data, compliance
- Transportation: Autonomous vehicles
- Financial Services: Claims data, call center recordings
Read: Why Harnessing Unstructured Data is a Top Enterprise Mandate
Read: Getting an Upper Hand on the Unstructured Data Problem
Why is Unstructured Data Growing so Fast?
The analyst firm IDC predicts that we will generate over 175 zettabytes of data by 2025 (one zettabyte is 4.4 Billion 1 terabyte drives!). They also predict that in the next three years we will generate more data than what we created over the past 30 years, and this growth trend will continue.
Most of the data we generate today is unstructured because unstructured data has several advantages over structured data:
- Wider Use Cases for Unstructured Data: Structured data has a rigid pre-defined structure and it can only be used for its intended purpose. This narrows the number of use cases for structured data – while it is useful for transactional applications like revenue tracking or catalogs, it is not a good use for applications that generate data that is not so easy to categorize such as video or genomics.
- Various Formats: Unstructured data can be stored in a variety of formats – from a mp4 video to a genomics BAM file to a .log diagnostics file to an X-RAY image that may be stored as a digital PACS format, all of these are types of unstructured data. So, an accurate way to describe unstructured data is that it has a variety of formats and not just one format. This means more applications can generate unstructured data and tailor the format to their use.
- Various Sizes: Unlike a cell in a database, unstructured data does not have to be a specific size or character limit. For example, you can have small video files for short snippets and large video files for full length movies. This also increases flexibility in how unstructured data is generated and used.
Since unstructured data is easier to create and use, more applications and users are working with unstructured data.
Unstructured Data Management
Managing growing volumes of unstructured data generated within an organization are leading to higher expenses.
What to know about unstructured data:
- Volume: The sheer quantity of data will continue to grow in a incomprehensible rate
- Velocity: The quantity of data is coming in at a continually faster rate
- Variety: The types of data continue to be more varied
These 3 Vs of unstructured data, originally defined by former Meta Group / Gartner industry analyst Doug Laney, means that managing unstructured data growth is critical for organizations as they find their budgets and resources are getting stretched to their limits.
Unstructured data management requires an understanding of what data is hot and actively used, and what data is cold and rarely accessed. In most enterprises, over 80% of unstructured data becomes cold within a year of creation – yet it continues to be managed on the most expensive storage and it continues to consume expensive backup resources. Analytics-driven data management of unstructured data can change this by identifying hot data and cold data across storage and managing hot data on expensive environments while offloading cold data to lower cost passive management. Unstructured data management should be done without restricting access to the cold data – so users and applications continue to see and access the cold data exactly as before, while the organization saves on cold data storage and backups. To understand how Komprise enables enterprise IT organizations to analyze, move, and manage unstructured data and save costs on storage, backup and cloud infrastructure read the white paper: Komprise Intelligent Data Management Architecture Overview.
Unstructured Data Migration Challenges
Migrating unstructured data to the cloud has grown in popularity to save data storage costs, consolidate data centers, modernize IT infrastructure and take advantage of cloud-based services such as AI, ML and analytics. But there are many challenges when it comes to unstructured data migrations to the cloud, including:
- A global enterprise typically has billions of predominantly small files, which have significant overhead, causing data transfers to be slow.
- Server message block (SMB) and NFS protocol workloads, which can be user data, electronic design automation (EDA) and other multimedia files or corporate shares, are problematic since the protocol requires many back-and-forth handshakes which increase traffic over the network. The SMB protocol in particular, is known to to have WAN transfer performance challenges, meaning cloud migrations can take much more time than IT organizations anticipate if not done correctly.
- File protocols are sensitive to high-latency network connections, which are unavoidable in WAN migrations.
- Bandwidth is often limited or not always available, causing cloud NAS migration data transfers to become slow, unreliable and difficult to manage.
AI Needs Unstructured Data
In a 2022 blog post, Komprise co-founder and CEO wrote about unstructured data management as the foundation for artificial intelligence (AI) and machine learning (ML) initiatives.
Enterprises need to be ready for this wave of change and it starts by getting unstructured data prepped, as this data is the critical ingredient for AI/ML. This entails new data management strategies which create automated ways to index, segment, curate, tag and move unstructured data continuously to feed AI and ML tools. Unforeseen changes to society, fueled by AI, are coming soon and you don’t want to be caught flat-footed.
What is Unstructured Data?
Unstructured data refers to information that doesn’t have a predefined data model or is not organized in a pre-defined manner. Unlike structured data, which is typically organized into tables and follows a specific schema, unstructured data lacks a clear and consistent structure. This type of data is often text-heavy but can also include images, videos, audio, social media posts, emails, and other forms of content. 90% of all data generated in today’s digital age is unstructured. The sheer volume of unstructured data makes it challenging to manage and analyze using traditional methods. There is an estimated 120 ZB of data in the world today, according to Statista. IDC expects data to grow to 175 zettabytes by 2025.
What are Examples of Unstructured Data?
Examples of unstructured data include:
- Text Documents: Word documents, PDFs, emails, and other textual content.
- Multimedia Files: Images, videos, and audio files.
- Social Media Feeds: Posts, comments, and multimedia content from social media platforms.
- Web Pages: Content from websites, which may include text, images, and multimedia elements.
- Sensor Data: Data from sensors, such as those in IoT (Internet of Things) devices.
What is Unstructured Data Management?
Unstructured data management refers to the processes and strategies involved in handling, storing, organizing, and extracting value from unstructured data. Effective unstructured data management is crucial for organizations looking to harness the potential insights and value contained within diverse and voluminous datasets. As technologies and best practices continue to evolve, managing unstructured data becomes an integral part of overall data management strategies.
See the definition for Unstructured Data Management and download the State of Unstructured Data Management report.
Why does AI need Unstructured Data?
According to a IDC report sponsored by Box:
In 2022, 90% of the data generated by organizations was unstructured, and only 10% was structured. That year, organizations globally generated 57,280 exabytes of unstructured data — a volume that is expected to grow by 28% to over 73,000 exabytes in 2023. To put this in perspective, an exabyte is 1 million terabytes, or 1 billion gigabytes. Seventy-three thousand exabytes of unstructured data is equivalent to the amount of data in over 97 trillion sequenced human genomes; it’s also equivalent to the amount of video streamed to 2.7 billion screens 24 hours per day for an entire year.
Success with AI depends upon harnessing this data and feeding the right data at the right time to AI platforms. This is difficult and costly not only because of its tremendous volume, but also because of how unstructured data is dispersed across data storage siloes in the enterprise. Komprise delivers a Global File Index for granular search and tagging of data across silos. In addition, with Komprise Smart Data Workflows, you can create custom workflows to easily search, find, and tag the exact files you want across all your hybrid cloud storage and create a plan to move the right unstructured data to a data lake or AI tool. Komprise delivers a storage-agnostic, analytics-based unstructured data management platform that automates data workflows for AI.