Late Binding - The New Standard for Data Warehousing Part 1 (Webinar)


Late Binding – The New Standard for Data Warehousing (Transcript)

Those of you that have been around long enough will remember the old days of software engineering when we would write what we call “very tightly coupled codes”, hundreds of thousands of lines of codes sometimes in one module. And we would bind all of that software together at compile time, thousands of function points all together. Sometimes those function points really had nothing to do with one another until they were needed. But nevertheless, we would bind all of that code together. And again, for those of you that remember, if one thing broke, everything broke in that software. It created a lot of problems for us to maintain, it was hard to write, and it was hard to test. It took a long time to produce functionality because you had to deliver everything at once. And then there was very little flexibility or agility after we deployed the software. To a large degree, this is one of the reasons that we have such a hard time upgrading electronic medical records today. As a CIO, I’ve managed Cerner and EMR upgrades and one of the reasons those systems are so difficult to upgrade is because they’re still based around these old-fashioned software engineering principles that tightly couple lots of code together. And so, it makes it very challenging to upgrade those systems, and to those of you who have been involved, it’s a month long process to do that.

Origins Late Binding

1980s: Object Oriented Programming
The origins of late binding came along in the 1980s when object-oriented programming became more commonly practiced and understood. Dr. Alan Kay who is at University of Colorado and Utah and also Xerox/PARC, wrote some seminal papers and books on the topic. And the concept that he advocated was that we start writing our code in smaller objects and that these objects should be named and reflective of the real world in which they were being written. So we would label those objects in terms that were familiar to the business that we were supporting. And then we would compile those individually and we would link them at runtime rather than at compile time only when they were needed for interaction. And this led to some significant improvements in agility and adaptability to address new use cases.

Steve Jobs

One of the things that I think Steve Jobs doesn’t get credit for the commercialization of these concepts that Alan Kay originally wrote about. So when Steve Jobs went to NeXT computing, of course being the kind of brand that he was, he looked around at what was the current state of art for software engineering. And even though Steve Jobs was not a programmer himself, he could see the elegance of what Alan Kay was advocating. And so that became a standard for software engineering at NeXT computing and late binding then became the norm in Silicon Valley. And I really think that even though Steve Jobs gets credit for Apple computing, I think what he did at NeXT around software engineering– everything that we see now with Google and Facebook and Amazon, all of the agility and the adaptability to support tens of millions of clients with very complex applications is al attributable back to Alan Kay and Steve Jobs.

Applying Binding Concepts to Data

So a few years ago, I was contemplating my successes and failures in data warehousing. And also as a consequence to the Healthcare Data Warehousing Association that we started, I had the opportunity to look across a lot of organizations and the way that they were operating and whether they were succeeding or failing. And the patterns that I saw in those successes and failures mirrored the patterns that I learned from in software engineering. And it occurred to me that the failures were associated with early binding and that the more successful implementations of data warehouses were a reflection of late binding.

And so I started developing this concept a little bit further in my head and I realized that atomic data can be “bound to business rules about that data and to vocabularies related to that data. So in healthcare, some of the vocabulary binding is pretty obvious. It includes things like unique patient and provider identifiers, standard facility, department, and revenue center codes, standard definitions for gender, race, ethnicity, and of course ICD, CPT, SNOMED, LOINC, RxNorm, RADLEX, etc. Those are all vocabularies that we can bind data to.

Even more complicated are the business and clinical rules that we bind to data, in algorithms and things such as that. So calculating length of stay is an example, attributing a patient to a provider and defining that relationship for accountable care, allocating expense or revenue to departments and physicians, for example, data definitions of general disease states and patient registries, what I call data definitions of numerators. But one of the unique features of analytics in healthcare is this complex manipulation of numerators, and that’s one of the important aspects of late binding – is that in healthcare, complex manipulation of numerators of patient types are one of the most difficult things that we face, and it’s one of the most important reasons we have to maintain analytic agility. Finally, there are things like exclusion criteria for patients that we want to exclude or re-categorize in disease and population management and then rules around the admission and discharge and transfer rules.

Six Points to Bind Data in an Analytic System

And after further reflection on this concept, I realized that there really are six places that we could bind rules and vocabulary to data in the flow of data through a data warehouse. I’m sure many of you have seen diagrams similar to this, in which the source system content is on the left and that flows into the data warehouse on the right. On the left are areas of vocabulary and business rules where you can bind to those vocabulary and business rules if they are a low volatility environment.

So one of the examples I like to give in that area is the patient identifier. If you’re lucky enough to work in a healthcare organization that has a unique patient identifier across all of your healthcare delivery organizations, then it makes a lot of sense to accept that binding, accept that unique patient identifier, and pull that into the data warehouse. And in so doing, everything downstream of that binding then takes advantage of that unique identifier. So all of the data downstream of that binding benefits from that, and it’s probably one of the lowest volatility vocabulary terms in healthcare. In theory, patient identifiers, once assigned, shouldn’t change in the life of that patient.

On the other hand, high-volatility vocabulary or business rules, those that change a lot, you should withhold that binding until layers 5 or 6 of this model. And to give an example of that, it’s not unusual for organizations to have a lot of debate about how to define a particular disease state. Diabetes seems to be a fairly common debated item. You would think that we would have standard definitions of these disease states by now but we don’t. And if you go from organization to organization, there’s a lot of debate about how to define a diabetic patient from a data perspective.

So what I advocate is that you loosely bind that definition of a diabetic patient, usually in one of the disease registries here, or maybe not even at all initially, and you let end users explore the various definitions that they want to entertain as a diabetic patient definition. Let them explore, let them decide as an organization how they want to fingerprint that definition. And then once there is some agreement upon that data definition, then bind it at layer 5 in one of your disease registries.

So again, going back to this diagram for just a minute. If the vocabulary or the business rule is consistently at low volatility, then binding in points 1, 2, 3 is usually a very safe practice. If it’s a more volatile vocabulary, if it changes quite often, or if the business rule or clinical rule is more volatile, then you want to wait until points 5 or 6 to bind those. And this is an interesting issue to highlight when you hire and you train and you mentor data engineers and data architects. They also have to understand, in addition to the technical understanding that they have with SQL and databases and things like that, they need to have an understanding of the business environment and the clinical environment that they are supporting, so that they can understand and predict what the volatilities of vocabulary or the business rule will be that they are implementing.

And so, that’s one of the things that’s a little different about the teams that I have managed operationally and the people that we hire at Health Catalyst. We hire and we mentor great technical skills but we also feel it’s really important that our data engineers and data architects understand the volatility of the business rules and volatility of the vocabulary that they’re dealing with because they are going to have to make a decision about where to bind that data. And if you bind too early, everything downstream of that binding is now beholden to that definition, so you’ve inherently limited your flexibility downstream with that binding. And again, in some cases, you want to limit that flexibility, but in many, many cases in healthcare, especially in today’s environment where there’s so much volatility, you want to allow for as much agility and adaptability as possible, which means binding later in these layers 3, 4 and 5 and sometimes in layer 6 too.

Data Modeling for Analytics

So as it turns out, there are really about 5 basic approaches to data modeling in analytics, and I’ve listed these in progression of early to late binding.
1) There’s what’s known as the Enterprise Information Model or a Corporate Information Model and it’s typically advocated by Bill Inmon and Claudia Imhoff. It’s quite often associated with Oracle, for example, which has a healthcare data model that’s been around for quite a while. IBM has a healthcare data model that’s been around for a while.
2) I would say that I2B2 is a kind of a flavor of a Star Schema but it’s advocated primarily by academic medicine.
3) The Star Schema is a dimensional model that advocated by Ralph Kimball. They’ve been around for a number of years.
4) There’s the Late Binding Bus Architecture, that I’m a strong advocate of that.
5) And then finally there’s what I call File Structure Association and it was originally popularized to be common in the IBM mainframes in the ’60s when relational database engines didn’t even exist yet. And now it’s interesting to see that reappear in Hadoop & NoSQL where there is no data model and that is the latest and loosest binding possible. And we’ll talk a little bit more about Hadoop & NoSQL later but right now there’s not a huge and compelling reason to use Hadoop & NoSQL in healthcare.

If you look at the origins of Hadoop in particular, the origins of those were really driven around scalability. Google and Yahoo face scalability challenges that no organizations has ever faced before, not even in the military. And so, they couldn’t depend on the traditional relational databases to satisfy that scalability. There had just too much overhead. So the good news is that Hadoop is enormously scalable and all the tools around it are very powerful, but of course it’s very complicated as well. But it is the latest of the late binding approaches.

Early vs. Late Binding in Data Warehouses

So let’s talk a little bit more about the early versus late binding in data warehouses. And a criticism that I’ve had for many, many years and those of you who know me know this is nothing new, but the downfall of these enterprise data models, whether they be Inmon or Kimball, is that they are inherently “early binding” in their design. So if you go back to that previous diagram, they are binding at points 2 and 3 of the flow of data consistently. And they map to standard vocabularies early. A lot of times they map in ETL from the source systems or immediately after. It’s not always a bad practice. But those of you who have been around data modeling long enough know that when you try to match data together from disperate source systems in the upcoming model, you will inherently have to make compromises about the way that data is represented. It’s going to change the way it’s represented from the source systems into this new model. But unfortunately, those rules that you bind in that data model are very volatile and they vary from one organization to another, and they change often as a result of changes in the industry.

So the net effect here is that it takes a long time to map data into those enterprise data models and I’m sure those of you who have done this will attest to that. It’s a very difficult process. There’s a lot of debate. It takes a lot of time to do it. And then after you’ve managed to get through that mapping, there is also very poor agility that follows. So it’s a very lengthy time to introduce new data into the data model. So for instance, let’s say that you go through this practice with your organization and then you acquire a new hospital or you acquire a physician group and they have a new source system that you need to incorporate into your data warehouse. You have to go through that same lengthy mapping of source system data into this new data model. So there’s an inherent delay in time-to-value right there. But then there will also be additional compromises that you have to make in that data model about the way to represent that source system data from the hospital or the physician group that was part of the acquisition.

So one of the things, before I go to that slide, just let me enter a thought here, and that is the key to success in analytics and data warehousing is not modeling data, and that’s a hard thing for data modelers and data architects to let go of. And I’ve been there. I know how hard it is and it took me many years and some very significant failures before I realized that that data modeling was holding me back. What you want to do, and this is one of the beauties of Hadoop, is you want to relate data. You’re not trying to model data, you’re trying to relate data. So the beauty of tools like Hadoop is that they are almost infinitely flexible in the way they can relate data. Now again, as I mentioned, Hadoop is not necessarily appropriate for healthcare right now but it does give an example of how powerful late binding and no data modeling can be – because data models generally speaking in analytics tend to be restrictive rather than enabling agility.

A Small Difference in Complexity

So, I always like to use this as an example of the requirements spec for retail analytics. So if you look at where Corporate Information Models and Star Schemas originated, they really originated in the retail industry. And Walmart is well known for the success and the amazing capabilities of their data warehouse. But the reality is most of the requirements for the Walmart data warehouse can be described in a sales receipt and that’s the reality. Now it’s a little more complicated than that obviously to run Walmart but a large part of what they’re trying to understand is what’s contained in a sales receipt.

A Portion of the Requirements Spec for Healthcare Analytics

In contrast to that, this is a requirements spec for healthcare analytics. This is a screenshot from electronic medical record. And it’s only a portion of what we have to deal with. It’s enormously more complicated. It’s enormously more fluid. And so, you must retain analytic agility and adaptability if you’re going to be successful with analytics in healthcare, and it’s a completely different world in retail and manufacturing and I’ve been there and I’m speaking from experience. Healthcare was a very humbling experience for me when I first came into it. I came in with a very successful career in the US Air Force and with the National Security Agency. I was the chief architect for Intel’s first EDW. So I had a pretty good track record and I was immediately humbled at the complexity of analytics in healthcare, and it’s taken me a long time and I’ve stabbed my toe a lot of times to finally figure out how to do this the right way.

Expanding Ecosystem of Healthcare Data

So going back to the complexity of what we deal within healthcare, that medical record is only part of what we do, and our ecosystem of analytics in healthcare is continuously expanding. So these are the data sources and the types of data that we’re dealing with right now for the most part – billing data, lab, imaging, and then particular reports from images, inpatient EMRs, outpatient EMRs, health information exchange and claims. That pretty much constitutes our data ecosystem at the moment. But it’s expanding. We’re looking at home monitoring, external pharmacy data, bedside monitoring data. That’s also entering into our environment right now. And finally, patient reported outcomes, 24/7 biometrics, genomic data and long-term care data. Those are in our future. So what all this tells is the need for a very adoptable analytics design. And the traditional way of modeling data in these big data models that bind early is going to be really problematic for organizations as we progress from left to right in the ecosystem of healthcare.

Your Data Acquisition Checklist

So this is what I call your data acquisition checklist and I mean this quite literally. This is from 1 to 16, the progression of data that you should have on your checklist for acquisition, not only as a transaction system supporting workflow but also content in your data warehouse. So you should have billing data, lab data, imaging data, inpatient/outpatient EMR data, claims data, in your data warehouse now if possible. Claims data tends to be the most difficult right now for a lot of healthcare delivery organizations. In the next 1 to 2 years, you need to have a plan for HIE data, and I’ll talk about why these are highlighted in red, cost accounting data, bedside monitoring data, external pharmacy data, familial relationship data, home monitoring data. And then in the next 2 to 4 years, we have to include patient reported outcomes, long-term care facility data, genomics, and real time 7/24 biometric data for all patients in the ACO, not just patients that are sick.

I highlight two of these data sources and data content areas in red because I think they represent some of the most significant gaps in our strategy for data acquisition in healthcare right now, that if we don’t get past these, we’re not going to be able to progress as far as we need to in the industry to improve quality and reduce cost as much as we should. For example, there really aren’t any vendor solutions right now, workflow solutions or vendor solutions, that provide detailed cost accounting data to healthcare. I would suggest that maybe 1% to 2% of the healthcare delivery organizations in the US right now have a handle on detailed cost accounting data. And so what that means is we don’t really know what it costs to provide healthcare. We’re guessing right now. And if you start taking on risks in an accountable care organization and you don’t know what your detailed costs are, you’re gonna be in a lot of trouble. Right? So we have to put pressure as customers on the entrepreneurs and the vendors in the industry to start developing detailed cost accounting systems for us. They capture data at the point of care for delivery, so it’s what I call instrumentation of the workflow, so that we’re capturing data easily. We’re not imposing more clicks upon nurses and physicians but we’re using barcode readers, point of presence, real time location services, those kinds of things that collect data about our workflow and our detailed cost at the specific patient level. No more guess work across big patient population. That’s not (23:16) good enough.

The other thing that’s really important too that I see no progress on in the vendor space is patient reported outcomes data. My Toyota maintenance guy down the street collects more outcomes data from me as a customer than my physician does. And until we start collecting better patient reported outcomes data that are tailored to the protocol that I was treated under, predictive analytics will largely be a guess work. We cannot accurately predict and monitor patients and their health status without better patient reported outcomes data.

So, we need to collectively as a voice, those of us who are interested in this data management world, start carrying this message to vendors and hopefully there are some entrepreneurs out there that are listening to this as well.

Binding to Analytic Relations

So one of the things that, again, I wanna emphasize here is that in data warehousing, the key is to relate data, not model data. And right now, there are about 20 data elements that constitute 80% to 90% of analytics. That’s the best, the reality of the situation. We tend to fall in love with and we’re kind of enamored with the possibility of NLP and free text data analysis and that’s gonna be an important part of what we do in the future. But right now, we can go a long way to benefit in healthcare if we just focus on the analytics associated with these core data elements that tend to overlap virtually every source system in healthcare right now.

So the message here is don’t get too concerned about vocabulary and free text and that sort of thing right now. Focus on these very simple, relatively simple I would say, and achievable analytic vocabularies to help you drive your strategy in the next year to two years. We’ll get there and we’ll have the tools for free text and other more complicated things later. But focus on the basics first.

Health Catalyst’s Late Binding Bust Architecture

So in Health Catalyst, and by the way you could take the Health Catalyst name brand off of this and you could look back at the data warehouses that I have advised on and the teams that I’ve led at Northwestern and Intermountain and replace Health Catalyst with Northwestern or Intermountain or any of the number of things, but this architecture applies way beyond what Health Catalyst does.

So there are source systems in our environment, we talked about that. We pull those source systems into what we call the Late Binding Bus Architecture. And again that bus architecture is just the ability to relate across these source systems using these common terms and vocabularies. And those of you that have been in the details of databases know that all this really amounts to is adding a foreign key to these systems as you land into the data warehouse. You add the foreign keys to these systems and you standardize the data types, you standardize the name, so that if you want to query across these systems on CPT code, it’s very easy to do, and you don’t have to remodel the data. We’re landing this data in the data warehouse with minimal restructuring and minimal remodeling. The data structures look almost exactly the same as the source systems, but we had added foreign keys so that we can now relate that data without remodeling it across all of these source systems. So what this allows you to do is land data and new sources, the Et al over here, in your data warehouse in a matter of days and weeks and immediately make it available for analysis. You don’t have to remodel it. So you’re cutting your time to value from what you know is typically months, if not years, for a new source system down to literally days and weeks. And then finally, you’re exposing this platform of data to Health Catalyst specific applications. So we build a number of applications that leverage this platform. We encourage our client to develop their own applications and share those with other clients. There are third-party applications that we can interface with and supply data to, for instance, products like Crimson, Midas, that sort of thing, and then just basic old Ad Hoc Query tools that have been around forever as well. Tools like QlikView, Business Objects, Microsoft Access, and even Excel.

Later Binding

So, as your analytic use cases mature and expand, so will your need to bind to new vocabularies. So start off with those “Core Data Elements of Healthcare” that we talked about earlier. Then, bind as necessary to LOINC, RxNorm, SNOMED, HCPCS. Someday our source systems will bind to these for us. Right? Over there in point 1 of the binding diagram and we’ll be able to incorporate these from the source system. Unfortunately, the EMR vendors and the pharmacy vendors and the lab vendors haven’t done a good job binding the data that they collect in those source systems, and so we end up having to do (28:44). Hopefully that will change as we go forward in the industry.

Healthcare Analytic Adoption Model

This is a diagram of our healthcare analytic adoption model and we developed this with the collaboration of a lot of wonderful colleagues and friends across the industry. And we really believe that it’s an effective roadmap for organizations to plot their own progression and adoption of analytics. But it also gives you an opportunity to assess vendors and ask them and ask them to prove how they support each of these levels of adoption. And hopefully over time this will become a tool similar to the HIMSS EMR Adoption Model, by which we can measure the adoption of analytics across the industry.

So just really quickly. From level zero up, the industry for the most part is represented by fragmented point solutions right now and that means there are a lot of standalone analytic applications in the environment right now. It’s inefficient from a lot of perspectives and maybe the worst thing it does is it produces inconsistent versions of the truth, and we’re trying to move away from that. And if you think about the evolution of electronic medical records and clinical systems, this is similar to the old best of breed HL7 interface environment in the past, which has been largely replaced by the single integrated systems like Cerner and Epic. And having managed both fragmented systems and clinical systems in the old days at Intermountain, they’re managing Cerner and Epic, I can tell you right now I would much rather manage these integrated systems of Epic and Cerner. And likewise, that’s the same thing that we’re going to see in data warehousing. We’re gonna move away from these fragmented point solutions to a more integrated platform. And the first step above that is the creation of an enterprise data warehouse and establishing a foundation of data and technology from which you can build these other levels.

The next level up from that is overlaying standardized vocabulary in patient registries. So, starting to organize your data, relating and organizing the core data that’s in level 1. The next thing you need to do is take care of your basic automated internal reporting. So rather than the fire drill that all of us go through every month trying to meet the CFO and CEO and CMO’s reporting requirements, turn this into a very efficient, hands-free and consistent production of those reports. And it’s entirely possible to do that. That’s not a dream world, especially when you add this level 1 data to work upon.

Then the next level up is automating the very complicated external reporting that we all face right now, you know, the (31:41) meaningful use, Joint Commission, the more private affairs, things like STS and (31:49) would also be included there. And in addition to being efficient and consistent here, we also need to be very agile because these are changing quite a lot.

At level 5, we start entering the world of differentiation around our data. So up through level 4, there’s really not a lot of differentiation across organizations. But at level 5, now this is where organizations start to differentiate themselves from being one at the middle to being excellent. We start dealing with clinical effectiveness measures. We’re starting to manage populations of healthcare in addition to specific analytics related to the patients at the point of care. And all of this is focused around and supported by evidence-based medicine, both the traditional clinical trials but also what we call a quasi-experimental evidence that’s emerging from your own data.

Finally, to get through level 5, moving on to where we’re headed right now with more risky financial environment is what we call cost per case reimbursement models and creating a data driven culture. And so, at this level, you’re starting to take on fairly significant financial risks. This is the beginning of ACOs and you have to create a very data-literate, data-driven culture and there’s very specific things you need to do to train and cultivate that. So adapting concepts like the Toyota production system in the healthcare, making sure that virtually everyone is comfortable working with basic spreadsheets and things like that. So it’s process improvement, lean times of thinking and then basic ability to manipulate and interact with data.

At level 7, you’re working at cost per capita reimbursement now and now you move from cost per case at the cost per capita and you’re starting to engage in predictive analytics and you’re taking on greater financial risks and you’re managing it proactively. You’re reaching out to patients in this ACO level and you’re identifying those patients that are high-risk before they become high-risk.

And then finally the holy grail that we’re all looking to is reimbursement models that are focused on cost per unit of health reimbursement. So now you’re not being paid to treat patients, you’re being paid not to treat patients and to keep them healthy. And we had evolved into prescriptive analytics here. So rather than predicting risks and stopping there, we’re actually generating prescriptive interventions that are data-driven that tell the healthcare provider what’s best for, based upon data, that patient’s treatment or maintaining that patient’s health.

So there’s an increasing complexity of data binding in vocabulary as you move up. And unfortunately what I see a lot of times in healthcare is that we’ll try to deal with and we’re talking about things like predictive analytics, for example, is all the rage right now, but we don’t have any of these well taken care of yet in healthcare. We don’t have any of these basics taken care of. And in fact, we don’t even have a good data to support predictive analytics right now. The only thing that we’re worried about right now is predictive analytics used to be readmissions and that’s a pretty sad state of affair. You would think that we’d be thinking a little more broadly than readmissions as the only form of intervention and predictive analytics that we want to be engaged in.

Summary of Principles

So a summary of some of the principles. Delay binding as long as possible until there is a clear analytic use case that requires it. So that’s delay in binding the business rules and vocabulary.

Early binding is appropriate for business rules or vocabularies that change infrequently that are low volatility or that the organization wants to “lock down” for consistent analytics. If you’re trying to eliminate multiple definitions of length of stay calculations or what constitutes a diabetic patient or a congestive heart failure patient. You wanna bind that sooner than later.

Late binding, in the visualization layer, over on the far right is very appropriate for “what if” scenario analysis. So letting people kind of explore what if we define a diabetic patient differently, what is that going to do, is an important part of the process. And knowing when to bind and unbind that is really important.

Then one kind of sole suggestion that I have here is that in the design of the data warehouse, you wanna make sure that you retain a record of the changes to vocabulary and rules binding in the data models of the data warehouse itself. So it becomes this kind of recursive library of history around your data, and what that will allow you to do is model what used to be a business or a clinical rule and compare that to the way the definition is defined today. So basing that history into the data models to retrace those analytics steps is a very important thing to do and it also makes a very very handy configuration control tool as well.

[END OF TRANSCRIPT]