기사

Data lake vs. data warehouse

개요

This article explores the key differences between how data lakes and data warehouses capture and store data, the benefits and challenges of the two approaches for organizations, and examples of how each can be used to manage and leverage data within an enterprise.

It also looks at when one solution may be more preferable over the other, along with instances where it makes sense to use a combined approach. Plus, it examines how a third data model—the data lakehouse—compares.

Data lake vs. data warehouse: The fundamentals

The data lake concept postdates that of the data warehouse by at least a decade, but novelty and a clear connection with current data trends aren't the primary differentiators between the two. Let's take a closer look:

Data lake

Like the bodies of water they're named after, data lake solutions are often both wide and deep. The data lake is a design pattern for a system that functions in large part as a repository—one that can store massive volumes of data measurable in petabytes or even greater figures.

But the most notable feature of data lakes is that they're capable of holding raw, unprocessed data in many formats, whether the data is structured, semi-structured, or unstructured. They don't organize this raw data using tables, but instead rely upon metadata tagging methods and other simple identifiers to keep track of the information they hold.

Some data lakes decouple storage and compute functions—with a downstream platform handling the latter—while others provide both in an integrated solution.

Data warehouse

The data warehouse also bears similarities to the object it's named after—in that its contents are well-organized according to established schema. Warehouses store a large amount of current and historical data in a structured and uniform format after it's been ingested from various sources. These may include application databases, websites, and devices connected to the internet of things (IoT).

If data is in a warehouse-compatible form at the point of ingestion—such as structured query language (SQL)—it will require less processing. But this can't always be the case with all the data that a warehouse must hold. It will often have to undergo the extract, transform, and load (ETL) process before it can be accessed by end users with business intelligence and analytics tools. Even data in the warehouse's native format will likely require at least some processing, cleansing, deduplication, or other refinements before it's ready for use.

Because of the strict schema on which they are built, many data warehouses can only hold structured relational data—though some more modern varieties have been designed to store semi-structured data. Overall, the data warehouse is a subject-oriented, integrated, and consistent design pattern.

Data warehouse vs. data lake: Which is better?

Neither a data lake nor a data warehouse is distinctly "better" than the other. Each design pattern has its proponents, and various business users will work with the data warehouse more often than the lake—and vice versa. But to best understand where each of these big data solutions might fit into your organization's data strategy, consider what warehouses and lakes do best.

Data warehouses: Straightforward structure

The data warehouse design pattern will always be valuable to business units that work primarily or entirely with structured data. For example, a finance department would benefit from using the data warehouse to make quick queries on past versus present ratios of accounts receivable to accounts payable and analyze balance-sheet patterns over time. Along similar lines, HR could use the data warehouse to store large amounts of historical data regarding open enrollment in its health benefits program.

Warehousing benefits and challenges

In both examples above, users could further segment warehouse data into subcategories using the data mart model. Either way, users benefit from predefined schema of the data warehouse or its marts to keep information tightly organized. Also, data warehousing is often a concept that business users who aren't familiar with the nitty-gritty details of data science or engineering can grasp. As such, the data warehouse can make business intelligence more available—and more valuable—to these non-expert employees.

The concrete structure and format of data warehouses also create limitations. In most cases, it's not just that the data warehouse is restricted to structured—and, maybe, semi-structured—data, but it's also restricted to a single format. Data warehouse schema must also be strictly defined before ingestion. Therefore, a warehouse that's receiving data from a wide variety of sources will require considerable time, compute power, and cost to transform different data types to a uniform format through ETL.

Additionally, the data warehouse model doesn't account for data without a clear, defined use case at the ingestion point. If a business unit is strictly observing the data warehouse protocol, this data might be discarded, even if it could be useful later.

Data lakes: Vast scope

The value of a data lake begins with the capacity of the long-term data containers—typically object stores—that it utilizes to hold countless datasets. It's a scalable system that can easily expand or contract based on organizational needs, without major cost or disruption. For example, with a cloud data lake built on Amazon Web Services (AWS) or Microsoft Azure, engineers just need to purchase more storage capacity for their S3 or Blob systems, respectively—which isn't particularly expensive.

Pros and cons of data lake architecture

Data lakes' format flexibility makes them ideal for the modern enterprise. Their ability to contain multiple unstructured data formats—including text, image, audio, and video files, as well as database formats ranging from JSON and CSV to Parquet—means that no potentially important data lacks storage space. This eliminates the need for the time-consuming ETL process. It also makes data lakes ideal for managing event-driven streaming data in real time, as well as overseeing the unstructured raw data that is often integral to cutting-edge machine learning (ML) and predictive analytics solutions.

The width and depth of data lakes can also create challenges. Their lack of schema means they are not necessarily easy to navigate for business users who lack data management skills. This means they must use complementary data analytics software, enlist the guidance of an experienced data scientist, analyst, or engineer, or utilize some combination of both to access and leverage the information they need.

Also, without close data governance and well-established metadata tagging overseen by data analysts and scientists, the open architecture of a data lake might lead to serious disorganization and ultimately limit the usefulness of the information it stores.

The data lakehouse: Paradigm shift or flash in the pan?

Comparing the data warehouse and data lake models also requires examining the data lakehouse concept, which emerged in the late 2010s and early 2020s.

Data lakehouses are based on the architecture, structured processing, and organizational features of data warehouses, combined with the inexpensive storage capabilities and unstructured data compatibility of data lakes. Enthusiasm for data lakehouses' potential hasn't flagged among supporters, who have proclaimed the lakehouse is "here to stay."

The all-in-one appeal of data lakehouses sounds more than reasonable on paper. However, real-world implementation of data lakehouse architecture is still uncommon, even among large enterprises that could theoretically support it. Lakehouse adoption also requires organizations to use specific tools that may be incompatible with other elements of their data analysis and management ecosystem. It's too early to tell how—or even if—the lakehouse trend will significantly change the field of data management.

Use data warehouses and data lakes together for the best results

Data warehouses structure and package data quality, consistency, reuse, and performance with high concurrency. They're serving and compliance environments that allow business analysts and other key users to leverage data as effectively as possible. Data lakes are designed to support original raw data fidelity, long-term storage at low cost, and a new form of analytical agility. This makes them more ideal for staging and processing layers.

There are clear differences between these design patterns and their functions. Data teams can and should enable business users to work with both data lakes and data warehouses as their needs require. Doing so reflects modern data management best practices.

In enterprises, use of warehouses versus lakes varies among departments. For example, an organization's account management team might use a customer relationship management (CRM) application—and its relational database—as a primary data source alongside only a few other systems. The data warehouse might be all this business unit needs.

By contrast, the DevOps team, currently working on an artificial intelligence (AI)-driven product simulation app, must rely on the data lake for every complex unstructured data type fueling the project. Easy access to that data is critical even from a big-picture perspective, necessitating use of both the data warehouse and data lake.

Teradata VantageCloud can help any organization leverage both data lake and data warehouse architectures. With updated business analytics features, seamless data integration capabilities, and cloud-native deployment, it's the ideal connected data management and analytics platform for today's complex, data-driven business world.