This blog will cover how we use Elasticsearch to power Komprise Deep Analytics Service and how we are creating a massive and secure Global File Index for our customers to help them manage their data.
Data is having a bit of a moment. We know data is used to track, model, and make decisions for practically every facet of life so it makes sense that data is critical to managing… wait for it: data. What do we mean? We’re talking metadata: data about data. This is where Komprise comes in.
Our vision is for enterprises to move from managing storage to managing data.
We help customers by providing an intuitive UI (and also APIs) to create custom queries based on file metadata:
- Where the data is located (e.g., what file server, share, or cloud)
- Who owns it
- When it was created, modified, and accessed
- File name, extension, type, size
- Custom tag – add your own data, like project ID
Using this data you can find the needle in the haystack – or more likely millions of needles across many haystacks – and once you find those needles you can now access, protect, replicate or take other action on a very specific set of files. This is the breakthrough that lets IT make granular decisions about data according to business requirements, rather than just making sure there is enough physical disk space to house it.
Here are a few examples of how our customers use metadata to manage their data:
- Collect specific data sets from multiple sites and clouds and copy to another location for analysis by AI/ML to get value out of data;
- Hunt down data from former employees to confine for deletion, free resources and comply with regulations;
- Archive clinical data while enabling availability for use in future studies and speeding the development of new therapies.
Our goals:
- Enable customers to gain insights across massive data sets;
- Precisely locate granular file sets to support research needs and decision-making;
- Protect the security of customer’s metadata;
- Maintain a simple architecture for ease of operation and performance.
The Metadata Mandate
Komprise was founded in 2014 and since then has helped customers index and store a staggering amount of metadata – hundreds of billions of records or observations to date.
How does Komprise manage this metadata? We chose the open-source indexing engine Elasticsearch. Elasticsearch indexes, stores, protects, and is the engine behind Komprise’s global metadata index that enables you to zero in on specific data sets over multiple data centers and hybrid cloud.
This approach keeps the Komprise architecture simple. Using a dedicated metadata solution means you can store as much metadata as needed (creating additional tags for example) without the concern that Komprise’s performance will be impacted. Other solutions try to do everything in a single central database that restricts scale and suffers performance penalties as the metadata load grows.
How Does Komprise Run and Secure Elasticsearch?
Komprise is by default a SaaS offering, with both the Komprise Director and Elasticsearch infrastructure running in the cloud on behalf of our customers. Komprise manages the security, configuration, patching, and protection of Elasticsearch.
The Observers deployed as a grid are deployed on premises adjacent to the data and stream the metadata to the Deep Analytics index service where it is indexed in the Elasticsearch cluster. Just like other tasks handled by the Observers, the indexing is distributed over the scale-out grid. Get a closer look at the Komprise elastic software architecture. Read about Elastic Grid here. The diagram below illustrates how the Komprise Grid analyzes data over multiple data centers or clouds and streams the metadata to Elasticsearch while the Director queries and caches the results.
Customers use the Komprise console to create and execute queries using Deep Analytics hosted by their dedicated Director running in the cloud. The Director then executes these queries against the Elasticsearch cluster. To secure communications between the Director and the Elasticsearch, a secure ID is used to map the Director to the dedicated indexes hosted by Elasticsearch.
For customers that need to retain all data and metadata behind their firewall, Komprise can also run Elasticsearch in their data center. Even with on-prem deployment, we provide a fully-managed experience. The nodes running Elasticsearch are deployed from the on-prem Director as VM appliances and managed as an integrated component of the Komprise solution.
Emerging Deep Analytics Use Cases
Today we use the metadata to help customers make decisions about how to “right place” their data across storage resources. Tagging is the next step. Customers can tag data with a project ID for charge back, or tag X-ray images with demographic information to support clinical studies.
Object storage made the concept of metadata tags mainstream, enabling advanced AI/ML workflows to query and act on specific data sets. This ability is new to the realm of NFS and SMB and we are excited to see how customers put it to use.
Conclusion
We see an evolution of data management beyond just the storage infrastructure team. Analytics will enable the owners or creators of the data to help decide how their data is stored and leveraged for future value. The ability to collect, store, index and enable search of this metadata in an intuitive manner is the critical component that will move us to data-centric management. Saving money on data management is the first step for most customers. Being able to do more with that data will help them drive real innovation.