With healthcare delivery business models actively changing throughout this decade, our industry has witnessed a flood of digital technology startups working to solve new problems that arise with these changes. Traditional investors and health systems alike have welcomed this interest: the funding of digital healthcare companies has quadrupled in the past six years. However, a number of these digital healthcare companies are failing due to a lack of access to healthcare data.
While it’s true that there is a wealth of data available within healthcare organizations, there is not a single, concise data model that supports use cases in a sustainable way. Because of this limited access to data, some of the greatest ideas for solving today’s healthcare problems may never come to fruition. The industry’s history is littered with failed attempts to model healthcare data, from comprehensive enterprise data models that try to model every healthcare concept that might ever be encountered, to home-grown data models driven by ad-hoc requests for data points in which the value of the information added to the model is not assessed.
We live in a time of great opportunity to leverage new technology to improve healthcare, but realizing this opportunity requires concise, purpose-driven data models that also adapt to each organization’s specific needs.
At Health Catalyst, we are on our third iteration of how we approach data. We are two years into our journey of building and deploying the data models that support the Health Catalyst® Data Operating System (DOS™). Based on this extensive experience—successes and failures—here are the five pragmatic lessons we’ve learned about building adaptive healthcare data models:
Health Catalyst built concise yet adaptive data models using the most widely employed data elements from the products and analyses we’ve developed with our health system clients since our founding in 2008. Following the principles of late binding, we employ several strategies for including only the most relevant content in our data model:
It’s important to focus on the most relevant content and specific use cases: what people are looking for on a regular basis versus all the data points users may ever need for data analysis.
While we have not found an industry-defined data model that’s pragmatic to use in business-critical analytics and outcomes improvement, great research and experiential knowledge have gone into several data models, such as HL7’s FHIR model. We’ve analyzed and compared our data models to FHIR specifically, as there is an increasing convergence towards FHIR for data exchange. We also benefit from alignment with FHIR as we move to add FHIR-compliant APIs to DOS. We found a great deal of overlap, yet diverged from it in a few areas where FHIR wasn’t well-suited for our analytics use cases.
No existing data model is perfect, but we can learn from what exists and leverage the great work that’s already been done in healthcare data modeling.
I’ve worked at three data-driven healthcare companies over the past two decades. I’ve trained hundreds of data analysts from dozens of healthcare organizations on a handful of different data models, and simple entity-relationship diagrams (ERDs) are the most effective tools I’ve found for quickly bringing a data analyst up to speed. In the past, my team compiled this documentation into a binder that we published, which clients and team members referred to as their “bible.”
Even experienced data analysts need a guide to the tables that exist and how they relate to each other before they can get started. Analysts want to know where data comes from, so the system we built self-documents the data pipeline, showing analysts where the data is sourced from. Once data analysts understand the high-level data model and the data’s origins, they can dive in and make use of the data. Other detailed questions about the data model will inevitably come up: “Which date/times are used to calculate LOS?” and “What’s the exception criteria (e.g. transfers)?” This is where a detailed data dictionary comes in, with definitions of each table and each column, including any nuances of what data should be populated within it.
Data model documentation should both enable its users to quickly come up to speed on the model and serve as a reference manual when detailed questions come up.
At Health Catalyst, each data model has a product owner responsible for the long-term success of the product. Each data model’s owner and team are embedded in the groups that use the data model; they are experts in their data model’s domain. Product owners work with deployment teams to understand what’s working and what’s not, build product feature backlogs and roadmaps, and manage regular releases of the data model. These teams treat data models like products by leveraging effective development tools, such as source control and integrated development environments (IDEs).
An example of long-term planning comes from our recent evaluation of one of Health Catalyst’s larger data models, in which we discovered unused fields and duplicative fields that were causing slower data refreshes each night. We successfully trimmed the data model’s size by 24 percent, resulting in faster load times.
This focus on the long-term success of the data model, versus only fixing acute issues as they arise, is key to ensuring it stays healthy and relevant.
In nearly every analytics project that I have participated in, testing data quality is a very manual endeavor that’s done once, up front. For example, when a data warehouse goes in, experienced data analysts are asked to evaluate the quality of the data. Expert analysts know which areas are most prone to issues, like the downstream effects of duplicates within a referring provider table on data quality, and they aren’t fooled by false positives, like “missing” orders for reflex lab results.
While an expert review is necessary, it’s terribly labor intensive and much of the work is looking for well-known patterns. We strongly advocate using automated data profiling tools to support experts in data quality reviews. Automated data profiling can identify the most common issues and patterns that affect data quality. For example, our automated profiling tools look at column fill rates and value distributions within the data—information that helps expert analysts quickly identify problems, such as a 20 percent fill rate on the primary diagnosis field for inpatient visits. Data profiling tools allow an analyst at a pediatric cancer center, for example, to use expert judgement to interpret a 19:1 annual ratio of prescription fills per patient as normal, whereas this would be a red flag for data quality at a health system with a healthier patient population.
Pair automated data profiling with an expert review to ensure the highest data quality. Validate data quality upon initial system installation and proactively monitor through automated data profiling against production systems.
Healthcare has made incredible advancements in developing healthcare data models—models that underlie the innovative technologies designed to solve healthcare’s problems. The five lessons outlined in this article come from Health Catalyst’s experience as we continuously evolve our standard data models, but they also reflect an industrywide data model evolution.
Healthcare continues to build and invest in innovative digital solutions. As a result, the need for concise yet adaptive healthcare data models that supports this innovation—and the corresponding methodologies, tools, and best practices that go into building the model—is stronger than ever.
Would you like to learn more about this topic? Here are some articles we suggest: