In Healthcare Predictive Analytics, Big Data Is Sometimes a Big Mess
Those in Big Data and Healthcare Analytics circles will seldom hear the phrase “less is more.” In a clinical setting however, there is an important lesson to learn in regards to the effective execution of predictive analytics. We should not confuse more data with more insight. More data is simply more—as in more tables, more lists, more replicates, more clinics, more controls, more rows, tables of tables and lists of lists, etc. You get the idea. In short, for predictive analytics to be effective in a clinical venue, a specific focus will always trump global utility. This is especially true for two reasons. First, specificity means improved performance and accuracy of the algorithm. Second, it will improve the effectiveness of any associated intervention.
Successful Predictive Analytics in Healthcare: 4 Reasons Why Predictive Analytics Does Not Depend on Big Data
The key to successful predictive analytics implementation is rooted much more in upfront planning than harnessing big data. It actually begins well upstream of the predictor and implementation.
First, it is essential to accurately model the workflow and detail the specific question at hand that you want the computer to address.
Next, collect the necessary data specific to and characteristic of that problem space. Gathering data is often divided into three areas:
1) What is known about the specific patient?
2) What is known about that population?
3) What supplementary data can be leveraged from external and public sources?
The goal here is to stay specific to the original question at hand. Using the computer to help with feature selection can be especially useful during this step.
Third, recognize the weaknesses and leverage the strengths of various algorithm approaches. For example, linear regression is typically used for continuous data, while logistic regression is for categorical/discrete data. Naive Bayes deals with missing data much better than many other approaches. Support vector machine algorithms are powerful as a non-linear classifier (and have excellent performance in binary classification), but they are computationally demanding to train and to run and sensitive to noisy data. Notably, “human readable” classifiers such as rules based or regression can implement directly into many SQL reporting environments, while other approaches that are “machine readable” (e.g. random forest, neural nets) may require additional programming.
And finally–but perhaps most importantly—find the appropriate clinical group and environment to implement. Prediction should not be done for prediction sake. The not-so-obvious irony is that without having the proper framework in place, with willingness to intervene and context for meaningful use, prediction is really not very useful. In fact, it is often a waste of time and money.
Don’t Trade Utility for Big Data Hype
With so much hype surrounding market buzzwords such as Big Data and Predictive Analytics, it can be daunting for healthcare organizations to sort through all the noise in this space. One guiding principle is to not trade useful for sexy. In healthcare, the trade-off of a more generalized prediction model that inputs Big Data and global features is that targeted utility is lost or diluted. The very features that characterize a condition well are the attributes that can train an accurate predictor. But if those variables (features) do not stand out above the background noise, the predictor only finds the noise well. For this reason, prediction focused on a specific clinical setting or patient need will always trump a generic predictor in terms of accuracy and utility.
The full power of clinical prediction is best realized when the computational question is carefully defined, specific variables are gathered, a targeted need is met, and participants are willing to act.
For predictive analytics, it’s the intervention that matters most. After all, it’s the intervention–not the predictor–that will improve patient care.
What experiences have you had with Big Data in healthcare? Have your predictor sets found a lot of noise or have you found variables that seem to work well?