Why Predictive Modeling in Healthcare Requires a Data Warehouse

recently, organizations that had the ability to mine and analyze data were mostly conducting retrospective analyses.7-8 Today, as their analytic capabilities mature, a growing number of healthcare systems are adopting predictive tools. Most organizations, however, are either in the early stages of building data warehouses or are using standalone analytics for particular purposes without the infrastructure required to apply these tools on a broader scale.9


Predictive algorithms enable computers to recognize patterns in data and draw deductions from those patterns that show the likelihood of particular events occurring in the future. This kind of algorithm is used in many types of activities, ranging from detection of credit card fraud and the optimization of search engines to stock market analysis and speech recognition.

To create a predictive algorithm, developers first define a problem, gather data, and run and evaluate different models to solve the problem. Next, they select the best model and validate it. Finally, they test the model by running it against a real-world dataset.

Figure 1: The Modeling Process

To improve the accuracy of predictive modeling, developers may take an approach known as “supervised learning,” in which the outcome is known ahead of time and is used to “train” an algorithm. But in healthcare, many important kinds of patient outcomes are not captured as structured data. Without outcomes data to train the algorithm, it’s difficult to apply a supervised learning model.

Some outcomes of interest can readily be measured. For example, if a predictor is designed to identify the patients who are at the highest risk of being readmitted, the outcome is a readmission within a certain period of time. Similarly, if an algorithm predicts which patients are most likely to have out-of-control hypertension or to be noncompliant with particular medications, those endpoints may be documented in structured data that can be analyzed.

In contrast, the health status of patients after discharge may not be available unless patients fill out functional status surveys at specified intervals. Also, the follow-up on most patients after discharge or between office visits is limited or nonexistent. As a result, only the data generated in the EHR during a visit or an episode of care may be available. Diagnoses, lab values, medications, and vital signs from these encounters appear in a data warehouse, but they don’t reflect the time period between visits, which would show how the patient fared between visits or episodes.

Even the episodic data are frequently not structured in the EHR, partly because some providers don’t enter them in the ubiquitous pull-down check boxes. For example, studies have shown that patient diagnoses are often missing from discrete data, although they usually appear somewhere else in the record.10

Paid claims data, in contrast, always include diagnostic and treatment codes. Moreover, claims data show the services and prescriptions that patients received from providers outside an organization or network. But claims have a built-in lag time, so they’re not very good for predicting what might happen in the near future. Furthermore, claims are not precise enough to describe in detail what has been done for the patient in various care settings.

For most analytic purposes, organizations rely on a combination of clinical and claims data, if they have access to the latter. ACOs, in particular, are expected to depend on claims data for years to come.11 But to make the best use of predictive analytics, healthcare organizations must build data warehouses capable of aggregating, normalizing and cleaning up this data and presenting it in a format that is easy to use in report generation.

Specificity and clinical insight

Predictive modeling is more accurate when it is applied to specific subpopulations and care settings than when it is used generically across cohorts and organizations. A generic readmission predictor developed in-house, for example, was validated to perform with a 79 percent positive predictive value (PPV). In contrast, a readmission predictor developed by Health Catalyst and applied to patients with congestive heart failure has a 91 percent PPV. The latter model is more accurate because the variables it uses are more specific to the population involved. In other words, the very features that characterize a specific condition well are the same attributes that can train an accurate predictor. Additional information is available at https://pages.healthcatalyst.com/rs/healthcatalyst/images/9.24.2013.DavidCrockettPredictiveAnalyticsWebinar.pdf.

Figure 2: Specific Improves Accuracy

A study by researchers at Emory University makes the same point in a different way. The researchers used an algorithm to predict readmission of post-surgical patients to a children’s hospital based on three variables: how many days a patient had been in the hospital, whether or not the patient had failed to thrive during the pre-operative period, and whether or not the patient was Hispanic. Researchers found that these indicators predicted most readmissions to that hospital.12 However, this algorithm could only be applied to areas where there are a lot of Spanish speakers, who are less likely to understand discharge instructions spoken or written in English.

Ironically, even without fancy predictive analytics in use, any physician or nurse would recognize this language barrier difficulty. Similarly, they know that low patient literacy, poor understanding of discharge instructions, failure or inability to make an appointment with a primary care doctor, and lack of communication between inpatient and ambulatory providers are all factors in readmissions.13-14 What is needed to solve these problems is not analytics, but action grounded in clinical experience.

Even where predictive analytics can help improve the quality of care, clinical insight is critically important to support and inform the use of these tools. Unfortunately that insight isn’t always available. For example, Northwestern University found that 30 percent of their own patients under chronic condition management were unable to participate in treatment protocols. The reasons were related to cognitive, economic, physical or geographic inabilities, religious beliefs, contraindications to the protocol, and/or voluntary non-compliance. These atypical patients must be treated or reached in a unique way, and predictive algorithms, data collection strategies, and interventions must be adjusted for their attributes. More information is available at www.healthcatalyst.com/webinar/predictive-analytics-its-the-intervention-that-matters/.

Clinical observations can also improve the accuracy of predictors. To illustrate, a patient wellness metric known as the Rothman index requires users to input not only structured data such as lab values and blood pressure readings but also the nursing assessment of the patient.15 The predictor would be a failure without the nursing notes, because it would be an incomplete snapshot of the patient. But the combination of the nursing assessment with the lab values and the vitals makes the Rothman index fairly accurate.

Predictive analytics, as noted earlier, is not very useful unless it can be applied to patient care to improve outcomes and efficiency. These tools, which might better be described as “prescriptive” analytics, should link predictions to specific clinical priorities, such as increasing the percentage of hypertensive patients who have their blood pressure controlled. The predictors should also be focused on measureable events, such as cost effectiveness, clinical protocols or patient outcomes.

To use these tools successfully, healthcare organizations must be willing to change their culture and their work processes. New workflows should be setup and organizations should deploy automation solutions to take advantage of the insights afforded by predictive modeling. Organizations must persuade clinicians to trust analytics that have been proved valid for particular kinds of clinical decisions.

Because of the potentially serious consequences of making the wrong decision in an emergency, predictive analytics are sometimes easier to apply to slowly changing situations such as chronic disease management, elective procedures, weaning patients off ventilators, and antibiotic protocols.

For optimal use in chronic disease management, predictive analytics should be applied to longitudinal rather than episodic data. This requires

Page 2 of 4
1 2 3 4