Data Management Glossary
Data Lakehouse
Data Lakehouse is a term first coined by the co-founder and then CTO of Pentaho, James Dixon. And while both Amazon and Snowflake had already started using the term “lakehouse,” it wasn’t until Databricks really endorsed it in a January 30, 2020 blog post entitled “What is a Data Lakehouse?” that it received more mainstream attention (amongst data practitioners at least).
You’ve heard of a Data Lake. You’ve heard of a Data Warehouse. Enter the Data Lakehouse.
A data lakehouse is a modern data architecture that combines the benefits of data lakes and data warehouses. A data lake is a centralized repository that stores vast amounts of raw, unstructured, and semi-structured data, making it ideal for big data analytics and machine learning. A data warehouse, on the other hand, is designed to store structured data that has been organized for querying and analysis.
A data lakehouse builds on key elements of these two approaches by providing a centralized platform for storing and processing large volumes of structured and unstructured data, while supporting real-time data analytics. It allows organizations to store all of their data in one place and perform interactive and ad-hoc analysis at scale, making it easier to derive insights from complex data sets. A data lakehouse typically uses modern (and often open source) technologies such as Apache Spark, Apache Arrow, to provide high-performance, scalable data processing.
Who are the data lakehouse vendors?
There are several vendors that offer data lakehouse solutions, including:
- Amazon Web Services (AWS) with Amazon Lake Formation
- Microsoft with Azure Synapse Analytics
- Google with Google BigQuery Omni
- Snowflake
- Databricks
- Cloudera with Cloudera Data Platform
- Oracle with Oracle Autonomous Data Warehouse Cloud
- IBM with IBM Cloud Pak for Data
These vendors provide a range of services, from cloud-based data lakehouse solutions to on-premises solutions that can be deployed in an organization’s own data center. The choice of vendor will depend on the specific needs and requirements of the organization, such as: the size of the data sets, the required performance and scalability, the level of security and compliance needed and the overall budget.
Komprise Smart Data Workflows is an automated process for all the steps required to find the right unstructured data across your data storage assets, tag and enrich the data, and send it to external tools such as a data lakehouse for analysis. Komprise makes it easier and more streamlined to find and prepare the right file and object data for analytics, AI, ML projects.