Data Lake vs. Data Warehouse: Which is Right for Healthcare?

Technology horizon conceptIn 2010, James Dixon introduced the concept of the Data Lake, and his idea has gained traction ever since. Dixon’s Data Lake is a style of data warehouse architecture, which he describes as follows:

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

He conceived of this architecture as a flexible alternative to traditional data warehouses that keep data in a very structured format. Although structured data is easier to consume, structure puts constraints on the analyses that can be performed. Structured data enables us to answer certain questions, but the structure may not accommodate questions that arise at some point in the future. The Data Lake concept allows for unstructured data—and more flexibility to answer new questions.

The Late-binding EDW and the Data Lake

At Health Catalyst, we see a lot of value in this idea of the Data Lake. In fact, Dixon’s Data Lake concept is very similar to our Late-Binding™ enterprise data warehouse (EDW) architecture. What Dixon calls a Data Lake, we call a source mart. We bring the data from our source systems into these source marts. It’s best practice to try to keep the data as raw as possible in the source marts, relying on the natural data models of the source systems. Although, the late-binding model does perform minimal data conformance to ensure the data from different source marts can work together. For example, making sure that the “patient name” field in one source mart is structured the same as “patient name” in another source mart. As much as possible, late-binding methods minimize remodeling data in the source marts until the analytic use case requires it. That’s how this approach takes the “water” in its natural state into the EDW. Then, when someone needs a report, analysts package it up from the source mart into a subject area mart so that she can perform her analysis.

Key Benefits of a Flexible, Late-binding Approach

We’ve used this architecture since our inception because it is the most efficient method for quickly harnessing data to support a particular improvement project. Here are three key benefits that make this the best architecture for healthcare EDWs:

  1. A user only has to go to the source systems one time. Once a process is set up to extract data from the source system, he never has to touch that source system again for analytics purposes.
  2. Health system leaders don’t know all the future analytics needs of their organizations. Analytic needs in healthcare are fluid—with standards and vocabularies evolving rapidly—which means that many new questions will arise in the future that they will want to use data to answer. A flexible architecture enables analysts to respond to any and all future needs. Like the Data Lake, the late-binding architecture allows for this.
  3. Users can scale the size of an EDW easily with this architecture using traditional Microsoft database tools. The team can start small and lean—pulling into the source marts only the data needed to address a specific use case—and then add more data as able.

How to Drink the Data Lake’s Water

In a discussion of a Data Lake (or any kind of data warehouse architecture), the key question is this: How are users going to use the water? It doesn’t matter how deep the Data Lake is if an organization hasn’t figured out how to use it to drive real improvement. It is of no use to anyone if physicians are standing around, thirsty for data but unable to access it. Instead, the data should be put to good use.

Health Catalyst offers three applications to increase the efficiency and effectiveness of the Data Lake:

  1. Metadata: Our metadata helps users understand incrementally what data is in the lake. In particular, it lets analysts know where each piece of data in the lake came from.
  2. Source Mart Designer: This application provides an efficient way to get data into the lake from the source systems and keep it refreshed.
  3. Subject Area Mart Designer: With this, users can easily pull data out of the lake and package it up into a consumable, useful format to solve a real analytical use case.

Healthcare systems leaders can use it to improve clinical quality or to drive operational efficiency. It can be used to manage an accountable care organization. Find an analytics partner with proven experience driving improvement and a flexible EDW architecture, and slake the thirst of physicians.

PowerPoint Slides

Would you like to use or share these concepts?  Download this presentation highlighting the key main points.

Click Here to Download the Slides

Loading next article...