Star Schema vs. Late-Binding™: Best Approach for a Healthcare Data Warehouse
Most discussions about data warehouse approaches in healthcare naturally lead to the question, “Why not just use a star schema (or some other early-binding approach)?” After all, the star schema is one of the most commonly deployed data warehouse methodologies. Its simple, straightforward design is considered a best practice for a wide variety of industries, including manufacturing, retail, telecommunications, and financial services.
When you look at Health Catalyst’s website, or any of our technical literature, one of the most prominent differentiators you’ll see is our approach to the healthcare enterprise data warehouse (EDW). The foundation of our technology is a Late-Binding™ Data Warehouse platform designed specifically for the analytics needs of healthcare providers.
The short answer revolves around the number of variables in healthcare data, the flexibility required to mine that data, and the expense/time to value of creating and delivering the EDW. Let’s look at these factors in more detail.
The Advantages of a Star Schema Data Warehouse in Other Industries
A star schema works well when you are doing counts or aggregation of counts by various (but fixed) dimensions of data. For example, if you want to know how many pairs of a particular brand and style of women’s shoes in size seven and the color blue were sold in the Northeast region in Q3, a star schema works well. It’s easy to provision a small data mart quickly to support the needs of different subject areas.
Another positive to a star schema is that, since analytics have been used in manufacturing, retail, and financial services for so many years, a wide range of tools have been optimized to leverage that model – to the extent that you almost feel forced to go that route. So why didn’t Health Catalyst follow that model?
A Star Schema Healthcare Data Warehouse
The answer is simple: healthcare data is too complex, with too many variables, to make a star schema (or any early-binding model) practical. Compare the shoe sales example above to the data surrounding the vital signs of a patient. When you take blood pressure, you need to know more than what the reading was; you also need to know:
- When the blood pressure was taken (date and time)
- What type of device was used
- Whether the patient was standing, sitting or lying down
- Whether the reading was taken from the left arm, right arm or some other location on the body
As a result, a simple concept such as taking vital signs has a multitude of conditional data points associated with it, making it too complex for a basic star schema to manage effectively with its one-to-many construct.
In a healthcare setting, you’re really dealing with many-to-many relationships. You typically will have multiple patients who have multiple encounters with multiple caregivers, multiple diagnoses, multiple procedures, and multiple results – and you have to resolve all of those relationships at the database level in order to draw useful analytics from it. To make that happen with a star schema you can try using a helper table, but that’s really a Band-Aid approach that doesn’t get to the root of the issue.
Star Schema vs. Late-Binding™ in Healthcare
The better option is a Late-Binding™ Data Warehouse built on top of a relational database structure. Unlike a star schema, which requires significant transformation to move the data out of the source application and make it useful for analytics, a Late-Binding™ Data Warehouse allows you to get data out of the transactional system at the most detailed, lowest level of granularity with the minimal transformation.
Much of the concern around when to bind the data is tied to accuracy – a critical element in any analytics program, but especially in healthcare where quality of life, and maybe even life itself, is on the line. The more transformation you have to do to get the data from the source system to the data warehouse/analytical environment, the greater chance you have of introducing errors into the data or covering up issues in the data capture process. More transformation also means it will be more expensive and time-consuming to map and change the data to get it into a usable format, which translate into very protracted project timelines.
With a star schema (or any early-binding method for that matter), you have to make binding decisions before you load the data into the target analytical repository. Often that occurs before you really understand the binding rules you’ll need. In addition, you have to transform all the data at once because it looks nothing like the applications that were used to capture the data, as opposed to building the data warehouse incrementally.
That’s a massive amount of data to transform, and the more data you have the longer it takes. If you discover after a few weeks (or months) that your binding rules have to change, you have to re-map and transform the data all over again, slowing or even bringing your analytics efforts to a halt until the process is finished. So, not only is it more expensive and time-consuming up-front, but the costs mount over time because of all these cascading changes.
In a Late-Binding™ approach, you can move the data from the transactional (source) system to the data warehouse with minimal transformation. The data is available and ready to use, but it hasn’t been committed to any particular relationships, so you have a lot of flexibility in terms of what you can do with it. You can then transform and bind it only when and if you have an actual need for it – similar to the “just-in-time” approach in manufacturing.
For example, suppose you want to measure readmission rates for heart failure patients. You can build a data mart to support that set of analyses, binding and transforming only the data sets you need to perform those analyses. Put another way, rather than binding all the contents of an entire library, you can just transform and bind the biography section within it.
With a Late-Binding™ approach, you spend far less time up-front (since you’re merely moving data to source marts to make it ready rather than actually binding it), and over time your costs are reduced because you’re only putting effort into binding data you actually need. When you do need to move the data from the source mart to the individual subject area data marts, you’re doing it within the boundaries of the EDW over which you have control, which saves additional time while giving you one less headache to manage.
Something else a Late-Binding™ approach brings is flexibility. Because it is built on top of a relational database, you can use many different commercially available tools to access it. Since you’re binding as late as possible, you also have greater flexibility if the data source or the rules change. The downstream changes are minimal and easily managed.
The net result is a faster time to value and a lower total cost of ownership.
How a Late-Binding™ Approach Works with Population Health Management
Here’s why the Late-Binding™ approach is so critical to healthcare organizations. Suppose you want to perform population health management for a cohort of patients who have diabetes and want to build a diabetes patient registry or data mart. For these patients you want to know their hemoglobin A1c (current and historical), their lipid profile (LDL, HDL and triglyceride), foot and eye exam history, the claims that have been submitted for these patients, their medication history, their BMI, and the history of their office visits.
With a star schema/early-binding model built for counting and aggregation rules related to a fixed set of facts, it would be difficult to support all the variables and the parent/child relationships for that population. The more you want to know how these facts relate to other facts, and those facts relate to a third set of facts, and what the attributes are around each of those facts, etc., the more complex it gets.
With a Late-Binding™ approach, you have the flexibility to include, exclude, and change the rules regarding these relationships as needed. As often as needed. Quickly and easily.
Healthcare Data Warehouse: The Right Construct
A star schema or other early-binding data warehouse makes sense and works well in many industries. Healthcare, however, isn’t one of them.
A Late-Binding™ Data Warehouse offers the flexibility to mine the vast number of variables and relationships in healthcare data effectively and leave room for the inevitable future changes. It saves time, lowers costs, and, most importantly, delivers the results that will help you mine your data in a way that improves patient outcomes.
Have you used a star schema approach in another industry? What worked well? What didn’t?