What Is a Healthcare Data Lake and Why Do You Need One? Imagine a Supermarket
In previous articles, we’ve compared a data lake to a data warehouse, and described early- and late-binding architectures. We’ve likened the data lake to a source mart filled with unstructured data, explained the time-to-value of a Late-Binding™ Data Warehouse, and argued which options are best for healthcare.
This article dives deeper into these analytic architectures, using a supermarket analogy to better understand data lakes and how health systems can successfully leverage their unique capabilities with existing analytic platforms.
Understanding the Healthcare Data Lake: Imagine a Supermarket
A data lake, as the name implies, is an open reservoir for the vast amount of data inherent with healthcare. A common misperception is that a data lake is a data warehouse replacement. On the contrary, a data lake is a very useful part of an early-binding data warehouse, a late-binding data warehouse, and a Hadoop system. As for data warehouses and Hadoop systems, some people disagree on their exact differences. Data warehouses typically use relational databases, have strict schemas, and run on a small number of expensive servers, while Hadoop systems use both relational and non-relational data, are more relaxed with schemas, and use lots of cheap servers.
The data lake allows data warehouses and Hadoop systems to receive and store various types of data while applying structure and meaning to it as specific use cases arise. For example, using a data lake, a health system can store all the medications prescribed to each patient. In the future, when that health system wants to determine if a patient was ever prescribed opioids, it can do so by applying structure to the medication list stored in the data lake. Herein lies the beautify of the data lake: by not having to apply the structure up front, health systems don’t have to know what they’re going to do with the data until specific use cases arise, such as the opioid example, saving them time, resources, and money.
To understand the data lake and its association with data warehouses and Hadoop systems, imagine a supermarket with a receiving dock where distributors deliver their products. The products are organized by how the manufacturer sends them, so a PepsiCo pallet has Pepsi bottles, Aquafina water, and Doritos chips (all PepsiCo products). This is analogous to a data lake. The distributed data is structured the same as the source from which the data originated. (Note that data, like web logs, forms, and notes, is also structured data stored in text form, but can be converted from text form to structured form by parsing via either pattern matching or natural language processing).
Let’s examine three analytic architectures by associating them with different ways to get products from the pallet to supermarket shoppers.
How a Data Lake Works with Three Common Analytic Architectures
Continuing our supermarket analogy, the loading dock (data lake) is stocked with PepsiCo products (data structured in the source format). There are three ways to get those products onto shelves and into the hands of shoppers:
#1. The Early-Binding Data Warehouse
Shoppers, of course, want to see all soft drinks in one place and all chips in another. So, someone opens the pallet and stacks Pepsi bottles on the shelves with the soft drinks, Aquafina with the bottled water, and Doritos with the chips. This is analogous to a traditional early-binding data warehouse where all the data must be moved and organized before it can be consumed. This model only works when there is enough time and resources to organize everything at the receiving dock. As the evidence shows, an early-binding data warehouse requires a lot of work before returning value. While this is standard operating procedure for a supermarket, it isn’t a good fit for healthcare data.
With insufficient time and resources to organize all the products, the alternative, when someone asks for a Pepsi bottle, is to return to the receiving dock, find the Pepsi bottle, and take it to the customer. This avoids the upfront work to categorize everything, but it creates more work because someone must search through everything on the receiving dock.
The early-binding data warehouse is time and resource intensive. Fortunately, there are two better approaches: late-binding and Map-Reduce Hadoop System.
#2. The Late-Binding Data Warehouse
A more efficient approach than early binding is moving the Pepsi bottles (or the Aquafina or Doritos) from the receiving dock to the store shelves as the store discovers that a lot of customers are asking for Pepsi. This is how to integrate a data lake with a late-binding data warehouse. The benefit is that only the required data is organized instead of all the data in the data lake. This is also called “schema-on-read” or “late binding” because structure and meaning are provided to the data (Pepsi bottles are organized in a group) only when read (as customers ask).
One other architecture, the Map-Reduce Hadoop System, is faster still, although considerably more resource-intensive to implement and sustain.
#3. The Map-Reduce Hadoop System
The limitation with the late-binding approach is the one person running back and forth to the receiving dock, so what if we hire three more people and divide the dock into four separate sections: North, South, East, and West? Now each person can look in their single section for Pepsi bottles. Clearly, this takes about a quarter of the time. Fundamentally, this is the concept of “Map-Reduce” in Hadoop systems, which divide problems into smaller parts, give them to different computers to solve, and then collect the results.
Once a health system understands how a data lake works and the analytic value it adds, it needs to determine how to integrate the lake into its existing analytics platform.
Health System Resources and Infrastructure Determine Optimal Data Lake Approach
There is a lot of data in healthcare, but not enough time and resources to map it. A data lake brings value to healthcare because it stores all the data in a central repository and only maps it as needs arise. Determining how to structure data before it’s brought in—although common in healthcare—wastes time, money, and resources. When the data is stored in the data lake, it’s impossible to know how to structure the data since all the use cases for that data are not known. Using the data lake approach of bringing data in and then adding structure as use cases arise is the right thing to do in healthcare to avoid multi-year projects that ultimately fail.
When it comes to integrating a data lake with the three common analytic architectures described in this article, the supermarket analogy demonstrates why an early-binding data warehouse is not a good fit for healthcare data: it requires a lot of time to map the data before realizing value. Health systems can choose between a late-binding data warehouse and a Hadoop system by determining the infrastructure and resources available—and the amount of data in their systems. Typically, when the number of patients is less than two million and the system does not already have a Hadoop infrastructure or sufficient resources to manage the Hadoop infrastructure, a late-binding data warehouse is the better, more practical choice.
Would you like to learn more about this topic? Here are some articles we suggest:
- Data Lake vs. Data Warehouse: Which Is Right for Healthcare?
- Early- or Late-Binding Approaches to Healthcare: Which Is Better for You?
- Data Warehouse Tools: Faster Time-to-Value for Your Healthcare Data Warehouse
- 6 Reasons Why Healthcare Data Warehouses Fail
- 10 Trends in Healthcare Data Warehousing That Every Health System Needs to Know