The Best Data Architecture: Know When to Bind Your Healthcare Data
I often talk about why a Late-Binding™ data architecture is the best option for an enterprise data warehouse (EDW) with healthcare data. Why am I so passionate about this topic? Because knowing when and how tightly to bind data to rules and vocabularies is critical to the success—or failure—of an EDW.
Previously, I described the strengths and weaknesses of the three most common data architectures used for EDWs in healthcare. These are:
- The enterprise data model popularized by Bill Inmon and Claudia Imhoff
- The summary data mart (often implemented using Dimensional data models, or the star schema), popularized by Ralph Kimball
- The adaptive, or Late-Binding,™ data model advocated by Health Catalyst
One of the significant differentiators between these three models is how early or late they require you to bind data. On the continuum of early- to Late-Binding™, an enterprise data model binds the earliest. A star schema is early binding but somewhat more flexible than an enterprise data model. And the Late-Binding™ model is the most agile of the three.
Early-binding models have worked very well in industries such as retail or banking where business rules and vocabularies are stable and predictable. But the healthcare industry is unique. Business rules and vocabulary standards in healthcare are complex and change constantly.
Binding data too tightly—too early—to volatile rules and standards puts the long-term viability of your EDW at risk. That is why flexibility in terms of when to bind data can determine the success of your EDW. And that is why an agile, Late-Binding™ data model is the superior architecture for a healthcare EDW.
Data-binding Points in the Late-Binding™ Data Model
I’d like to illustrate what it means to bind data late. Data-binding decisions involve determining the point at which we want to bind pieces of data to a specific business rule or vocabulary.
It’s important to note that the purpose of the Late-Binding™ model isn’t to bind everything late. In fact, there are times when it is appropriate to bind early. Rules and vocabularies that change infrequently—like patient or provider identifiers—should be bound early. But rules and vocabularies that are more volatile—an evolving standard or an area where no standard yet exists—should be bound later. The point of a Late-Binding™ model is to bind as late as possible, waiting until a clear analytics use case calls for it.
There are six points in a data warehouse at which data can be bound to rules and vocabularies (illustrated below).
Before I discuss these six points, I want to define the four columns in this diagram:
- The first column represents all the different source systems feeding data into your EDW. These might include internal sources like an EMR, a financial system, a patient satisfaction system and your HR system. They also might include external sources, such as academic or government databases.
- The second column represents source marts—important features of a Late-Binding™ EDW. We move data from the source system to these source marts but perform minimal transformation in the process. The source marts serve as a kind of staging area for analytics. In them, the data retains the granularity and characteristics it had in the source systems. At the same time, the data is linked to a “data bus” of a small number of core data elements that are common to almost all analytics use cases in healthcare (patient ID, provider ID, date and time, facility ID, etc.). The data bus consists, essentially, of early-binding decisions. Binding source mart data to these stable, core data elements enables you to query across disparate source system content in the data warehouse.
- The third column represents customized data marts, which we refer to as subject area marts (SAMs). We create a SAM to address a particular analytics use case. For example, we might create a SAM that is a diabetes registry. Typically, a greater amount of data transformation and binding occurs as we move data from the source marts into these data marts to address a specific use case.
- The fourth column represents the visualization layer. This layer consists of the actual reports—in a variety of formats—generated by your analytics system.
As you can see in the diagram, you can bind at six places, either in or between these columns.
The Cost-per-case Example
I’ll use the concept of cost-per-case to illustrate how these data-binding points work. In this example, you decide that you want one way to calculate how much each case costs—a single formula to define cost-per-case.
Now, I’ll concede up front that cost-per-case is a metric that most likely should be bound earlier than later. You don’t want Hospital A to have one formula for cost-per-case while Hospital B has a different method of calculating this metric. But, for the sake of illustration, I’ll carry this example all the way through the six binding points.
Here are the binding options for cost-per-case in a Late-Binding™ architecture:
- Bind in the source system: In this scenario, you decide that you want to capture cost-per-case in the source system itself—to enforce that rule and bind it to data in the source system. To bind at this level, you would actually go into your financial system (like PeopleSoft) and create a new field for cost-per-case.
- Bind during ETL to the source mart: You may decide that you don’t want to create a new field in the source system. Rather, you choose to bind the data while extracting and moving it from the financial system into its source mart. You take the case type, the cost, the number of cases, and all of the inputs needed to calculate cost-per-case. From those inputs, you calculate cost-per-case during the ETL process, and then you store that new cost-per-case element in the financial system’s source mart. Importantly, when binding at this point, you do not store the inputs in the source mart, just the resulting calculation.
- Bind in the source mart: Binding at this point is similar to number 2 above. The difference is that you store the calculation and the inputs in the source mart. You preserve the discrete inputs in the source mart and do not have to go back into the source system to find them.
- Bind during ETL to the customized data mart (SAM): Choosing to bind at this point means not storing the calculated cost-per-case in the source mart. Rather, you store the inputs in the source mart and then perform the calculation as the data is extracted and moved into a SAM. At this point, you would store the calculated cost-per-case—not the inputs—in the SAM.
- Bind in the SAM: With this approach, you store all the inputs and the calculated cost-per-case in the SAM. As I mentioned previously, with the cost-per-case example you wouldn’t want to bind so late in the process. You would open yourself up to the risk of ending up with different calculations from different analysts.
- Bind in the visualization layer: In this scenario, you only bind in the report itself. You bring in all of the inputs as you build the cost-per-case report, and you calculate cost-per-case on the fly. In the real world, you wouldn’t bind this late for cost-per-case. Binding in the visualization layer is appropriate for “what if” scenario analysis associated with modeling different reimbursement models or defining disease states. Once that exploratory “what if” phase is complete, the new models and definitions can bound at an earlier point.
The Late-Binding™ model brings so much flexibility to a healthcare EDW. See a new way to look at healthcare data models (complete with a video explaining the concept) or learn the nuanced differences between a database and a data warehouse.
What do you think about early vs. Late-Binding™ for healthcare data? Can you see the advantages of an agile system?
Powerpoint Slide Version
Would you like to use or share these concepts? Download this Six Points to Bind Data presentation highlighting the key main points.