Hadoop in Healthcare: A No-nonsense Q and A
Image from http://hadoop.apache.org/
Editor’s Note: A version of this article appeared at HITECH Answers under the title Much Hadoop About Something.
Hadoop is used in all kinds of applications like Facebook and LinkedIn. The potential for Big Data and Hadoop in healthcare and managing healthcare data is exciting, but—as of yet—has not been fully realized. This is partly because Hadoop is not well-understood in the healthcare industry and partly because healthcare doesn’t quite have the huge quantities of data seen in other industries that would require Hadoop-level processing power.
Although healthcare analytics haven’t yet been hampered by hospital systems not using Hadoop, it never hurts to look forward and consider the possibilities. And what possibilities there are! Hadoop is an indispensable tool for efficiently storing and processing large quantities of data. Its unique capabilities will offer new ways of thinking about how we use healthcare data and analytics to provide improved patient care at reduced costs. This is a discussion we should start now, and as a starting point for this discussion, here is a Q & A on Hadoop and its implications for the future of healthcare.
1. First and foremost, what is Hadoop?
When people talk about Hadoop, they can be talking about a couple of different things, which often makes it confusing. Hadoop is an open-source distributed data storage and analysis application that was developed by Yahoo! based on research papers published by Google. Hadoop implements Google’s MapReduce algorithm by divvying up a large query into many parts, sending those respective parts to many different processing nodes, and then combining the results from each node.
Hadoop also refers to the ecosystem of tools and software that works with and enhances the core storage and processing components:
- Hive – a SQL-like query language for Hadoop
- Pig – a high-level query language for MapReduce
- HBase – a columnar data store that runs on top of the Hadoop distributed file storage mechanism
- Spark – general purpose cluster computing framework
Unlike many data management tools, Hadoop was designed from the beginning as a distributed processing and storage platform. This moves primary responsibility for dealing with hardware failure into the software, optimizing Hadoop for use on large clusters of commodity hardware.
2. Yahoo!, of course, runs applications using Hadoop, as does Facebook as well as many of our largest tech companies. What are some of the key reasons for the move to the platform’s adoption?
Large companies have rapidly adopted Hadoop for two reasons, enormous data sets and cost. Most have data sets that are just too large for traditional database management applications. In the summer of 2011, Eric Baldeschwieler (formerly VP of Hadoop engineering at Yahoo!), CEO of Hortonworks (a company that provides commercial support for Hadoop) said that Yahoo! has 42,000 nodes in several different Hadoop clusters with a combined capacity of about 200 petabytes (200,000 terabytes).
Even if existing database applications could accommodate these large data sets, the cost of typical enterprise hardware and disk storage becomes prohibitive. Hadoop was designed from the beginning to run on commodity hardware with frequent failures. This substantially reduces the need for expensive hardware infrastructure to host a Hadoop cluster. Because Hadoop is open source, there are no licensing fees for the software either, another substantial savings.
I think it’s important to note that both of these companies started using traditional database management systems and didn’t start leveraging Hadoop until they had no more scaling options.
3. How will Hadoop impact and/or change healthcare analytics?
In February of this year, HIMSS Journal released a report on big data, Big data analytics in healthcare: promise and potential. In the report, the authors list Hadoop as the most significant data processing platform for big data analytics in healthcare.
Using Hadoop, researchers can now use data sets that were traditionally impossible to handle. A team in Colorado is correlating air quality data with asthma admissions. Life sciences companies use genomic and proteomic data to speed drug development. The Hadoop data processing and storage platform opens up entire new research domains for discovery. Computers are great at finding correlations in data sets with many variables, a task for which humans are ill-suited.
However, for most healthcare providers, the data processing platform is not the real problem, and most healthcare providers don’t have “big data.” A hospital CIO I know plans for future storage growth by estimating 100MB of data generated per patient, per year. A large 600-bed hospital can keep a 20-year data history in a couple hundred terabytes.
Every day, there are more than 4.75 billion content items shared on Facebook (including status updates, wall posts, photos, videos, and comments), more than 4.5 billion “Likes,” and more than 10 billion messages sent. More than 250 billion photos have been uploaded to Facebook, and more than 350 million photos are uploaded every day on average. Facebook adds 500 terabytes a day to their Hadoop warehouse.
Southwest’s fleet of 607 Boing 737 aircraft generate 262,224 terabytes of data every day. They don’t store it all (yet), but the planes’ instrumentation produce that much data.
Healthcare analytics is generally not being held back by the capability of the data processing platforms. There are a few exceptions in the life sciences, and genomics provides another interesting use case for big data. But for most healthcare providers, the limiting factor is our willingness and ability let data inform and change the way we deliver care. Today, it takes more than a decade for compelling clinical evidence to become common clinical practice. We have known for a long time that babies born at 37 weeks are twice as likely to die from complications like pneumonia and respiratory distress than those born at 39 weeks. Yet 8 percent of births are non-medically necessary pre-term deliveries (i.e. before 39 weeks).
The problem we should be talking about in healthcare analytics is not what the latest data processing platform can do for us. We should be talking about how we can use data to engage clinicians to help them provide higher quality care. It’s not how much data you have that matters, but how you use it. At our upcoming September Healthcare Analytics Summit, national experts and healthcare executives will lead an interactive discussion on how Healthcare Analytics has gone from a “Nice To Have” to a “Must Have” in order to support the requirements of healthcare transformation.
4. Keying in on that statement of usage, how would clinicians use data sources outside their own environment?
Data from other clinical providers in your geography can be very useful. Claims data give a broad picture but not a deep one. Data from other non-traditional sources also has surprising relevance; in some cases, it’s a better predictor than clinical data. For example: EPA data on geographical toxic chemical load adds additional insight to cancer rates for long-term residents. The CMS-HCC risk adjustment model can help providers understand why patients in their area seem to have higher or lower risk for certain disease conditions. Household size of one increases the risk of readmissions because there is no other caregiver in the home.
5. What are the drawbacks of Hadoop and what do CTOs, CIOs and other IT leaders need to consider?
Compared with typical enterprise infrastructure, Hadoop is very young technology and the capabilities and tools are relatively immature.
So too are the number of people who have lots of experience with Hadoop. The only people with 10 years of experience are the two guys at Yahoo! who created it. If Hadoop solves a data analysis problem for your organization, you need to make sure you plan for enough skilled people to help deploy, manage, and query data from it. Remember your competition for these resources will be large technology and financial services companies, and people with Hadoop experience are in high demand.
You probably will also need to consider an alternative hardware maintenance approach. Hadoop was designed for commodity hardware, with its attendant higher failure rates. Instead of purchasing maintenance on the hardware and having someone else come fix or replace it when it breaks, you should plan to have spare nodes sitting in the closet, or even racked up in the data center. Deploying Hadoop on expensive enterprise hardware with SAN based disk and 24×7 maintenance coverage reduces the value proposition of the technology.
The good news is that the commercial database vendors, including Microsoft, Oracle, and Teradata, are all racing to integrate Hadoop into their offerings. These integrations will make it much easier to utilize Hadoop’s unique capabilities while leveraging existing infrastructure and data assets.
6. Finally, where is the development of this platform going and what will be its ongoing impact on big data?
Fifteen years ago, we didn’t capture data unless we knew we needed it. The cost to capture and store it was just too high. Fifteen years from now, reductions in the cost to capture and store data will likely mean that we will capture and store everything. Hadoop is a huge leap forward in our ability to efficiently store and process large quantities of data. This allows more people to spend more time thinking about interesting questions and how to apply the resulting answers in a meaningful and useful way.
Would you like to use or share these concepts? Download this Why Healthcare Data Warehouses Fail presentation highlighting the key main points.