How to Use Text Analytics in Healthcare to Improve Outcomes—Why You Need More than NLP

“Eighty percent of clinical data is locked away in unstructured physician notes that can’t be read by an EHR and so can’t be accessed by advanced decision support and quality improvement applications,” according to Peter J. Embi, MD, MS, President and CEO of Regenstrief Institute. Unfortunately, more than 95 percent of health systems can’t utilize this valuable clinical data because the analytics needed to access it is difficult, expensive, and requires an advanced technical skillset and infrastructure.

For example, a team of analysts in Indiana set out to identify peripheral arterial disease (PAD) patients across two health systems. The team hit a roadblock when they identified the structured EMR and claims data failed to identify over 75 percent of patients with PAD. To better understand high-risk populations, such as PAD patients, health systems must leverage text analytics, including text search refined with natural language processing (NLP). Doing so is key to the success of their clinical and financial strategies as they continue to take on more financial risk. So far, however, most health systems’ use of text analytics has been limited to research departments within academic medical centers.

This executive report explains why text analytics in healthcare is important in all areas of the industry—not just research—and demonstrates how, despite resource and infrastructure challenges, health systems can leverage it. It also describes the four critical components of text analytics, from optimizing text search to pragmatically integrating text analytics system-wide.

The Current State of Text Analytics in Healthcare

Health systems simply aren’t leveraging text analytics—Gartner estimates less than 5 percent do as of July 2016. Most systems manually curate patient registries and rely only on coded data—an approach that significantly limits their understanding of patient populations.

Picking back up on the search for patients with PAD from above, let’s understand why the analytics team was looking at this high-risk cohort. PAD is a condition in which narrowed arteries reduce blood flow to limbs; it affects more than 3 million patients every year in the United States. Patients with PAD are high-risk for coronary heart disease, heart attack, and stroke. This risk translates to both sicker patients and to a financial cost for health systems taking on more value-based care contracting arrangements.

At first, the team in Indiana identified less than 10,000 patients with PAD using a traditional approach (using the ICD and CPT codes for PAD). Hoping for a more complete patient list, the team tasked a sophisticated NLP group to write algorithms to identify more PAD patients. Integrating text analytics led to the discovery of over 41,000 PAD patients – an increase of more than triple the number found using traditional methods that are limited to codified diagnoses and procedures.

There is a clear need to leverage unstructured text data in healthcare analytics despite the limited use today.

The Typical Text Analytics Scenario

In the typical text analytics scenario, a data scientist writes an NLP algorithm to determine the number of PAD patients, validates the algorithm, and returns the result to the investigator. In this scenario, only two people in the health system understand the PAD patient population. Systems need more than just a handful of specialized algorithms serving small groups of stakeholders.

Figure 1: The Typical Text Analytics Scenario

A Better Text Analytics Scenario

In a much better text analytics scenario, the data scientist does more than create and validate an algorithm. The scientist deploys the algorithm to the system’s analytics environment to be run on a nightly basis, allowing any authorized analytics user to make use of the results. The algorithm output is combined with coded data to create a precise PAD registry. In this improved scenario, the health system knows how many PAD patients it has, how many new patients came into the system, and what the care gaps are. Systems need to make sure large groups of non-technical stakeholders can use text analytics to start their own queries and leverage the work of system-wide staff.

Figure 2: A Better Text Analytics Scenario

Now, let’s dive into how text search works in the real world.

Google: The Text Search Gold Standard

When it comes to text search, Google is the gold standard. People overwhelmingly believe that Google is easy to use, fast, and accurate. It finishes our sentences for us and, most of the time, gives us exactly what we’re looking for. Even though Google search is simple and straight forward, it’s a lot more complicated than it seems. Google analyzes every web page clicked, looks at other web pages that link to those pages, and examines synonyms too. Google seems simple but is a sophisticated text analytics machine. Google is an incredible text analytics tool for finding information in web pages online. What if finding healthcare data in medical records were as simple as using Google? Let’s look at what it would take to build “Google for the Medical Record.”

How Text Search Works

At the core of how text search works is an inverted index—an index similar to one at the end of a book that lists words in the text and where they appear. Search engines index documents. They read each document, break them into all the different words that appear in the documents, and create a sorted list of all the words.

Figure 3 features three simple clinical documents, starting with document 0. Document 0 states that the patient is a 67-year-old female with NIDDM (noninsulin-dependent diabetes mellitus) and hypertension; document 1 states that the patient has no diabetes or hypertension; and document 2 states that the patient’s mother and sister are diabetic.

The second column in the inverted index (“document”) indicates in which document each word appears. The third column (“inverted index”) indicates the position of the word in the document. For example, the word “old” appears in document 0 in the fifth position. The term “diabet” also appears, which is a word stem. Words that tend to have high level of variance are broken into word stems—a simpler version of the word—to broaden search capacity. In this example, “diabet” is mapped to documents 1 and 2.

Figure 3: The Inverted Index

Three Great Text Search Tools for Indexing Text and Providing Search Capability

Health systems can set up their own inverted indexes using one of several open source tools:

Tool #1: Lucene

Written in 1999, Lucene is a Java-based API that provides the foundation for more advanced search engine capabilities. Lucene creates and maintains the search index and does hit ranking and result sorting. Most users access Lucene through one of two technologies, Solr and Elastsicsearch.

Tool #2: Solr

Solr is an open source, Java-based enterprise search tool which pioneered advances on top of Lucene starting in 2004. Health systems can use any programming language to develop with Solr, which makes it easy to create an inverted index, simple web page, and search interface like Google. Many leading technology organizations use Solr, such as Apple and NASA.

Tool #3: Elasticsearch

Elasticsearch is a newer open source enterprise search tool, written in 2010. Elasticsearch, like Solr, uses Lucene as a foundation. Elasticsearch has built its own set of APIs and functionality over the inverted index. Netflix and Facebook use Elasticsearch.

Three Text Search Must-Haves: Display, Medical Terminologies, and Context

As demonstrated in Figure 4, the search for diabetes surfaced two indexed documents. Using word stemming, the search found both diabetes and diabetic. But there are two problems with this search: it missed the mention of NIDDM (a synonym) and neither result is relevant to a medical cohort query for diabetics.

Figure 4: Text Search Example

Although the simple, familiar interface and fast results (generated by the inverted index) worked well, three text search must-haves for medical searches—display, medical terminologies, and context—are missing.

Must-Have #1: Optimize Results Display for Use Cases

We must create a solution that matches the needs of those using the data. For example, users of data may not have permission to view PHI. If we aggregate the results of the text analysis, then we provide users with access to the key results they need without compromising privacy.

Must-Have #2: Expand Search with Medical Terminologies

The diabetes search in Figure 4 didn’t include NIDDM, a synonym of diabetes. Health systems will produce more sophisticated analyses by leveraging medical terminologies to automatically expand synonym lists. The best search engines suggest terms that match user queries as they type. Providing a quick and easy way for users to identify synonyms valuable to their search will increase the quantity and accuracy of the results.

Must-Have #3: Refine Results with Context

Context matters. To be useful for clinical applications (e.g., searching for genotype/phenotype correlations), retrieving patients eligible for a clinical trial, or identifying disease outbreaks, simply identifying clinical conditions in the text is insufficient—information described in the context of the clinical condition is critical for understanding the patient’s state.

If we are searching for patients with pneumonia, then searching for “pneumonia” without considering the context would result in identifying the following types of phrases: “ruled out pneumonia,” “history of pneumonia,” and “family history of pneumonia.” None of these indicate the patient currently has pneumonia.

ConText, an NLP pattern matching algorithm published in 2009, solves for these scenarios through enhancing text search and refining results by detecting conditions and determining if they are negated, historical, or experienced by someone else. In Figure 5 below, the ConText algorithm analyzes the sentence, “No history of chest tightness but family history of CHF” for several things:

  • Conditions (e.g., chest tightness)
  • Negation triggers (e.g., the word “no”)
  • Historical triggers (e.g., the word “history”)
  • Termination (e.g., the word “but”)
  • Experiencer triggers (e.g., the word “family”)

In this example, chest tightness is negated because the word “no” is a negation trigger.

Figure 5: The ConText Algorithm

Applying Context and Extracting Values with an NLP Pipeline

NLP pipelines can be run using one of several Java-based, open source tools, such as Apache Unstructured Information Management Architecture (UIMA) and General Architecture for Text Engineering (GATE). After starting a query with a clinical search and expanding that search to include meaningful terms, the next step is passing the search onto an NLP pipeline—in this example, a series of three analyses performed on the text that relate to and build on each other:

  1. Sentence detection.
  2. Entity recognition (e.g., diabetes).
  3. Context algorithm.

Figure 6: The NLP Pipeline

After the third and final step in this NLP pipeline (context algorithm), users are presented with additional filters (Figure 7), in which they can choose their desired context (e.g., “only include affirmed mention,” “only include recent events,” and “only include mentions in which the patient is the experiencer”). When it comes to text analytics frameworks, flexibility to run algorithms based on a user’s specific question is key. The NLP pipeline should create this flexibility by expanding search capacity to refine results.

Figure 7: Additional Filters

The NLP pipeline should also extract values as demanded by each use case. Many health systems are burdened by regulatory reporting, especially when certain measures like ejection fraction are not stored as discrete values. In lieu of automated reporting, health systems are obligated to engage team members to be “chart abstractors” to manually open thousands of patient charts to find ejection fraction values found in clinical notes. An NLP pipeline should not only identify when ejection fraction is documented in a note, but also save each value in such a way that the organization’s analytics platform can utilize the discrete value in their automated reporting. Other common extraction projects include aortic root size, PHQ depression scores, and ankle brachial index.

Figure 8: Extracting Values

Always Validate the Algorithm

Quantifying algorithm accuracy is critical. Every algorithm needs a validation workflow, which typically includes four steps:

  1. Build studies to review query results.
  2. Assign team members to review results.
  3. Randomly select records to represent study.
  4. Highlight key words for easy chart review.

When evaluating NLP tools or devising an NLP strategy, health systems should make sure validation is either built into the tool or very easy to do.

Focus on Interoperability and Pragmatic Integration

Health systems need to combine text analytics with discrete data, and an enterprise data warehouse (EDW) is a great place to do that. A siloed approach to text analytics won’t work; make sure to integrate validated text analytics with all other analytics within the organization. For example, a health system would put its diabetes and PAD cohorts into the EDW and make that data available system-wide. Health systems need a platform that can pull in data on a regular, timely basis to ensure it’s current.

Figure 9: Text Analytics Interoperability

Text analytics will open a new world of data analysis to health systems. It’s unlikely that anyone can identify the exact data uses and needs for text data in the next two, three, or five years. Instead of trying to make lasting decisions on a data model up front, we recommend focusing on what’s needed to perform timely, relevant analytics; healthcare analytics that quickly adapt to new questions and use cases. This approach follows the Health Catalyst® Late-Binding™ architecture that allows for the flexibility that’s so critical for the evolving world of text analytics.

How to Use Text Analytics in Healthcare to Improve Outcomes

The power of text analytics in healthcare to create precise patient registries is undeniable. Rather than manually curating patient registries and relying solely on coded data, as most health systems are forced to do, health systems should leverage data from coded data sets, NLP, and extraction projects to create precise patient registries.

And rather than limiting the use of text analytics to data scientists, architects, and analysts, health systems should make these tools widely accessible and empower the entire organization to leverage clinical text analytics. Health systems can start using text analytics to improve outcomes today by focusing on four key components:

#1: Optimize Text Search (Display, Medical Terminologies, and Context)

Using search technology for clinical text is an engaging and accessible entry point for text analytics problems. We recommend focusing on the three must-haves of text search: display, medical terminologies, and context, to make it easy for users to search based on their unique use cases.

Understanding the use case is important for displaying the results of the search: some users are looking for specific text notes to review, while others are just looking for aggregated summaries of patients that match the search.  Medical terminologies provide a dictionary of relevant terms, synonyms, and logical structures that can enhance clinical text exploration. Context facilitates cohort queries by analyzing text for negation triggers, historical triggers, and other factors.

#2: Enhance Context and Extract Values with an NLP Pipeline

NLP algorithms based on the context surrounding clinical terms can identify when the term is negated (“no evidence of pneumonia”) or applies to another person (“patient’s grandmother had breast cancer”). Regular expressions can be applied to text to identify patterns and extract discrete values, like ejection fraction and ankle brachial index, that are stored in text.

#3: Always Validate the Algorithm

Ensure a great validation process and tools exist to facilitate expert reviews of the algorithm to ensure accuracy, while building users’ trust in the data.

#4: Focus on Interoperability and Integration Using a Late-Binding Approach

Text analytics should integrate with an EDW using a pragmatic, late-binding approach, in which the information extracted from text can be combined with discrete, coded data.

This broad, all-encompassing approach to text analytics in healthcare will position health systems for clinical and financial success as they strive to create precise patient registries, enhance their understanding of high-risk patient populations, and improve outcomes. Health systems should invest in a solution that not only solves today’s problems, but can also be expanded to solve future use cases.

Powerpoint Slides

Would you like to use or share these concepts? Download this presentation highlighting the key main points.

Click Here to Download the Slides

Loading next article...