The Late-Binding™ Data Warehouse: A Detailed Technical Overview
months or longer. When new data sources are added to the data warehouse—as occurs in mergers, acquisitions, and ACO partnerships—this lengthy time- to-value is repeated again and again. Likewise, as the complexity of analytic use cases inevitably matures in an organization, the early binding data model must be modified and the source system data must be conformed and mapped again. These early binding data models cannot keep pace with the changes in the analytic environment and the data warehouse subsequently fails to deliver its initial appeal.
File structure association, popularized first by IBM mainframes some 60 years ago, is reappearing in the form of Hadoop, MapReduce, PIG, and NoSQL. Data warehouses based upon this technology exhibit the lowest degree of data binding and coupling. In fact, since there is no data model in this methodology, there is no data binding until the binding is declared between the files of data through associative programming. Data warehouses based upon this technology are very adept at quickly loading data into the warehouse, but the benefit of time gained in loading is more than negated by the complexity of the programming that is required to minimally bind and declare associations between the files and data.
The Late-Binding Data Warehouse is balanced between the extremes of early binding in Inmon, Kimball, and I2B2 with the no-binding environment
of Hadoop. The Late-Binding Data Warehouse emphasizes the following fundamental principles related to data modeling:
1. The key to success for data warehouses is relating data, not modeling data. When in doubt, model less, not more.
2. Minimize the use of new conformed data models in the data warehouse by instead leveraging the data models used in the source systems.
3. Apply data models to subsets of data—in data marts—when binding formerly disparate data in new contexts to support new analytic use cases.
4. Approximately 20 core data elements are fundamental to almost all analytic use cases in the healthcare industry. Early binding to these core data elements is a best practice. Bind to other terms and vocabularies later and only when required by analytic use cases.
The core data elements are shown below, illustrating how those data elements leverage the data models of the source systems to act as a data bus for Health Catalyst’s Late-Binding Data Warehouse platform. This approach allows queries across disparate source system content in the data warehouse in exactly the same fashion as the theoretical benefits of an enterprise data model but does not require development of, and conformance to, an enterprise data model. The diagram also illustrates how the Health Catalyst analytics platform can easily feed data to non-Health Catalyst applications.
Late Binding to Other Vocabulary Terms and Rules
As organizations progress in analytic maturity and sophistication, the need to bind to new and more complex vocabularies and rules will follow. By focusing first on the core data elements and then binding to additional rules and vocabulary when a clear analytic use case requires it, data engineers can deliver rapid time-to-value initially as well as later, when adaptability to new analytic use cases arises. Some of the additional vocabularies that are typically appropriate for later binding in healthcare include LOINC, RxNorm, SNOMED, and HCPCS. Business and clinical rules about data are even more complex and volatile. Health Catalyst’s Analytic Adoption Model, below, illustrates the relationship between progressively higher levels of analytic capability and the need to bind to more complex rules and vocabularies. The important concept to reemphasize is not to bind data to rules or vocabularies until the analytic use case requires it. Too often, data warehouse projects in healthcare attempt to bind data to rules and vocabularies in anticipation of functioning at Level 8 of this model when the organization is still operating at Level 0. It takes years to progress to Level 8, and during that time, rules and vocabularies in healthcare will undoubtedly change. As the old saying goes, don’t drive beyond your headlights. It is a dangerous waste of resources and time to bind to rules and vocabularies that are far beyond the current analytic use cases of the organization.
Summary of Principles in the Health Catalyst Late-Binding Data Warehouse
Below is a summary of the principles that underlie the Health Catalyst approach to analytics. These principles enabled data warehouses in the military, manufacturing, and healthcare that have been operating and adapting for over 20 years with an unparalleled track record for proven results.
1. Minimize remodeling data in the data warehouse until the analytic use case requires it. Leverage the natural data models of the source systems by reflecting much of the same data modeling in the data warehouse.
- Delay binding to rules and vocabulary as long as possible until a clear analytic use case requires it.
- Earlier binding is appropriate for business rules or vocabularies that change infrequently or that the organization wants to lock down for consistent analytics.
- Late binding in the visualization layer is appropriate for what-if scenario analysis.
- Retain a record of the changes to vocabulary and rule bindings in the data models of the data warehouse. This will provide a self-contained configuration control history that can be invaluable for conducting retrospective analysis that feeds forecasting and predictive analytics.