Data infrastructure is on the brink of transformation. This is a core message outlined by Bessemer Venture Partners in their Data Infrastructure: Roadmap, published earlier this year. The firm has an excellent thesis on how businesses are becoming data driven and the new startup ecosystem that is emerging to support this imperative. This thesis is interesting because Bessemer has recognized the need to look at data infrastructure as its own category due to its massive market opportunity potential.
The missing piece in the thesis is the challenge presented by unstructured data in the enterprise. The bulk of data being generated today is unstructured: files and objects such as medical images, video and audio files, IoT files, log files and so on. IDC predicts there will be 175 zetabytes stored worldwide by 2025, of which at least 80% will be unstructured.
The trouble is, all this data doesn’t neatly collect right next to the data analytics compute platform. There is massive data sprawl with data collecting at the edges, in various data centers and across different cloud vendors. The process of searching across these disconnected environments is enervating at best. Therefore, a means to find unstructured data across disparate silos, curate that list and feed it to compute engines for specific analytics-driven use cases becomes a critical task ripe for automation.
So while companies are making strides to become data-driven with structured and semi-structured data, what’s right around the corner is the need to manage and extract value from unstructured data.
Here is the Komprise take on Bessemer’s assessment of driving market trends for data infrastructure:
1. Growth in adoption of cloud software.
Bessemer writes about the momentum behind cloud analytics and cloud data warehouses to support the rapid movement of data to the cloud. Beyond data warehouses, cloud data lakes are on the rise because organizations can put any data into a data lake in its native form without any preprocessing. But this flexibility has also led to data lakes becoming unwieldy data swamps, especially of unstructured data, because this data has no specific schema. The problem is: “How can you make it easy to collect and ingest data – to search, collect, find and extract relevant data from a completely unstructured data lake or several data lakes or silos?” Komprise addresses this problem by automatically indexing file and object data, thereby creating an actionable global file index to easily search, find and use data from data lakes.
So, our statement would be: Growth in adoption of cloud software both for data ingestion and cloud native data services.
2. Increase in volume of accessible data.
Bessemer writes: “Enterprises now need flexible and seamless connections with various data sources such as databases, SaaS apps, and web applications, spinning up new sources as the number of systems they use to operate their businesses expands in the digital realm.” Data volumes are certainly growing in the cloud, but there is also the edge. We are just at the beginning of data at the edge. As edge data continues to grow, enterprises will no longer find it economical or technically feasible to stream all data to the cloud.
The Komprise architecture brings flexible connections to cloud, data center and edge sources containing unstructured file and object data. With Komprise, you can view, search, tag and create a culled list of just the data you want to analyze and transfer it to the cloud. This is why our data management and data mobility vision includes making data easily accessible and searchable everywhere and also enabling local processing and culling of data when needed. You will not be able to centrally store and process all the data.
So, our statement would be: Increase in volume of accessible data across edge, datacenter and clouds.
3. Data becomes a differentiator.
We agree–and machine learning is a critical underpinning capability. Machine learning relies on unstructured data, so the easier we make extracting the right unstructured data and ingesting it into ML, the faster we can deliver business outcomes. Also, this process is iterative: how can you use a cognitive service such as PII detection or audio sentiment analysis and preserve its outcome by inserting tags that represent that outcome? We are continually optimizing and enriching metadata through tags. The context is not trapped inside different cognitive services or data processing silos because the Komprise framework works across them. This is why global tag management in the Komprise Global File Index ensures that the learnings from any data processing are tagged, indexed and can be leveraged anywhere for future processing. Komprise continually optimizes and enriches data throughout its lifecycle.
So, our statement would be: “Optimized data becomes a differentiator” because raw, dark, inaccessible data is not usable, especially for unstructured data.
4. Demand for talent and sophistication in leveraging data.
We agree that automation plays a significant role in addressing the growing labor and skills shortage and also supports “citizen science”. How can we evolve from needing specialized data engineers to every functional role being able to leverage data for better outcomes? How can you focus your data scientists on the skilled analysis and not on the blocking and tackling of finding and ingesting data? As one of our customers said, “We bought Komprise because we no longer want our data scientists to be data finders and data movers.” Data growth should not have a deleterious effect on the skilled data professionals hired to model and analyze data. They should not be turned into data administrators.
So, our statement would be: “Shortage of talent pool will lead to greater citizen science and automated data management.”
The Bessemer thesis paints an elegant picture of the data journey from the source to data output. To reflect the journey of unstructured data, the graphic, under “Data Collection, Ingestion and Storage”, might well include Unstructured Data Management software to cover “Search, Collection/Curation, and Mobilization” of unstructured data. As an example, in the figure, below S3 is shown as a data lake. Yet S3 as a data lake does not have any way to optimize the finding, curating, and ingesting of the unstructured data in it. Also, S3 is not monolithic. It can consist of hundreds of buckets across multiple storage tiers and multiple AWS accounts. It does not consider that data may be spread across Azure Blob, Google Cloud Storage or on-premises object storage. Also missing is cloud file storage which is gaining prominence as well as hybrid cloud file storage which is an untapped source of unstructured data.
To summarize, the Bessemer thesis does a great job of capturing how businesses are becoming more data driven and the startup ecosystem that is developing to support it. Now, with the exponential growth of unstructured data in the enterprise, we will see the startup ecosystem expand to include the management, collection and mobilization of unstructured data and an increasing adoption of machine learning to process unstructured data for greater business value.