Healthcare NLP: The Secret to Unstructured Data’s Full Potential

Article Summary


While healthcare data is an ever-growing resource, thanks to broader EHR adoption and new sources (e.g., patient-generated data), many health systems aren’t currently leveraging this information cache to its full potential. Analysts can’t extract and analyze a significant portion of healthcare data (e.g., follow-up appointments, vitals, charges, orders, encounters, and symptoms) because it’s in an unstructured, or text, form, which is bigger and more complex than structured data.

Natural language processing (NLP) taps into the potential of unstructured data by using artificial intelligence (AI) to extract and analyze meaningful insights from the estimated 80 percent of health data that exists in text form. Though still an evolving capability, NLP is showing promise in helping organizations get more from their data.

This report is based on a 2018  webinar given by Wendy Chapman, PhD, Chair, Department of Biomedical Informatics, University of Utah School of Medicine, and Mike Dow, Technical Director Health Catalyst, entitled, “Tapping Into the Potential of Natural Language Processing in Healthcare.”

As EHR adoption increases and health data abounds, healthcare organizations have more analytics-driven opportunities to improve healthcare delivery and outcomes. Health systems, however, are having difficulty using all the available data to its fullest potential. Text (unstructured) data is a particular challenge, as it’s bigger, more complex, and has more sources and storage locations than structured data. But with the increasing use of natural language processing (NLP), organizations are growing their ability to get more actionable insights from healthcare data.

NLP leverages artificial intelligence (AI) to help analytics systems understand and work with unstructured data. The capability promises to extract useful data from EHRs and even give EHRs speech-recognition capabilities during clinician visits. EHRs, however, are currently frustrating clinicians, as they take time away from patient engagement and improving patient care. An American Medical Association poll of physicians showed that around half of all respondents were dissatisfied with their EHR’s ability to improve costs, efficiency, and productivity.

This article describes the potential of NLP to maximize the value of the EHR and healthcare data, making data a critical and trusted component in improving health outcomes.

The Promise of Healthcare NLP to Improve Outcomes

NLP processes unstructured data from different sources (e.g., EMRs, literature, and social media) so that analytics systems can interpret it (Figure 1). Once NLP converts the text to structured data, health systems can use it to classify patients, extract insights, and summarize information.

Diagram of NLP classifying, extracting, and summarizing unstructured text
Figure 1: NLP classifies, extracts, and summarizes unstructured text

Four areas in which healthcare NLP can improve function—and, ultimately, care—include EHR usability, predictive analytics, phenotyping, and quality improvement:

NLP Improves EHR Data Usability

The typical EHR arranges information by patient encounter, making it difficult to find critical patient information (e.g., social history—a strong predictor of readmissions). NLP can enable an EHR interface that makes patient encounter information easier for clinicians to find.

By organizing the interface into sections, and including words associated with concerns patients described during encounters, the interface populates the rest of the page with information related to that word. For example, all mentions of fatigue would show on a timeline at the top of the page, and the notes about the word would show in a box at the bottom of the page. The interface makes it easier for clinicians to find buried data and make diagnoses they might have otherwise missed.

NLP Enables Predictive Analytics

One of the more exciting benefits of NLP is its ability to enable predictive analytics to improve significant population health concerns. For example, according to recent reports, suicide has been rising the United States. Healthcare professionals are working to understand who is at risk so they can intervene. A 2018 study used NLP to predict suicide attempts by monitoring social media. Results showed clear indicators of imminent suicide attempts by Twitter users who posted fewer emojis in text, limited emojis to certain types (e.g., blue or broken heart symbols), or increased postings of angry or sad tweets prior to attempting suicide (Figure 2). The system had a 70 percent prediction rate with only a 10 percent false positive rate.

Graph showing use of NLP to recognize suicide risk in emoji use
Figure 2: Using NLP to recognize suicide risk in emoji use

NLP Boosts Phenotyping Capabilities

Phenotype is an observable physical or biochemical expression of a specific trait in an organism. These traits may be related to appearance, biochemical processes, or behavior. Phenotyping helps clinicians group or categorize patients to provide a deeper, more focused look into data (e.g., listing patients who share certain traits) and the ability to compare patient cohorts. Currently, most analysts and clinicians use structured data for phenotyping because it’s easy to extract for analysis. NLP gives analysts a tool to extract and analyze unstructured data (e.g., follow-up appointments, vitals, charges, orders, encounters, and symptoms), which some experts estimate makes up 80 percent of all patient data available. Access to unstructured data makes a lot more information available to create phenotypes for patient groups.

NLP also allows for richer phenotypes. For example, pathology reports contain a lot of information, such as a patient’s condition, location of a growth, stage of a cancer, procedure(s), medications, and genetic status. While traditional analytics can’t access that data from pathology reports, NLP empowers analysts to extract this type of data to answer complex, specific questions (e.g., cancerous tissue types associated with certain genetic mutations).

NLP Enables Health System Quality Improvement

The federal government and associated agencies require all hospitals to report certain outcome measures. One required measure is adenoma detection rate (ADR), which is the rate at which doctors find adenomas during a colonoscopy. The current process for reporting is to pay someone to analyze a small sampling of patient charts, read through the pathology reports, and calculate the ADR. NLP automates and accelerates this process, increasing the sample size of patient charts and allowing real-time analysis.

A clinician has developed a report card that uses NLP to automatically calculate ADR. Studies show that when physicians can see quantifiable results of their performance, they tend to change their behavior. In this case, physicians who receive feedback about their ADR changed their behaviors to improve detection rate. This is important because for every 1 percent increase in ADR, there is a 3 percent decrease in colon cancer mortality.

While the four areas in which NLP enhances the value of healthcare data show significant promise, NLP has a long way to go to widespread adoption and a large-scale impact on outcomes improvement.

Challenges and Limitations of NLP

Most health systems have not yet begun using NLP to its full potential. This is likely because implementing NLP successfully comes with significant challenges:

Garbage In, Garbage Out

The old saying “garbage in, garbage out” applies to NLP. Good, usable data can only be extracted if the data is easy to identify. When digging out data from EHRs, analysts often find a problem with the way data is entered: people commonly enter type information, which increases their tendency to use shortcuts and create templates. NLP looks for sentences, not templates, making it difficult to handle data within templates. Cut-and-pasted text presents another challenge; this shortcut leads to propagating more patient data than is relevant (note bloat) as well as outdated or inaccurate information throughout health records, making clinician notes less useful.

Modeling for Meaning Can Be Challenging

NLP runs on text—a series of words strung together. NLP systems need to extract meaning from text and infer context, which is not easy to do. If developers don’t model NLP systems well to find meaning from the start, the systems won’t scale well.

NLP Works on Specific Sublanguages

Sublanguage, a subset of natural language, is another challenge for NLP. Medical language is a sublanguage with a subset of vocabulary and different vocabulary rules from the main language. To extract meaning from sublanguage, NLP systems must understand the rules of that language. Social media, for example, is a sublanguage. It uses abbreviations and emoticons to express meaning (versus using words for the same concepts). With these differences, analysts cannot run an NLP system trained on newspaper text on social media and expect it to extract the meaning.

Medical language has different sublanguages within it. For example, medical blogs and clinical notes use different language. Because of these differences, health systems should not purchase off-the-shelf NLP systems built for one sublanguage and use it on another. Developers and analysts have to tailor NLP systems for use on a specific language (e.g., healthcare)., and that tailoring process takes time.

NLP Doesn’t Yet Distinguish Linguistic Variation

With linguistic variation, there are many ways to say the same thing (e.g., derivation, in which different forms of words have similar meaning, and synonymy, in which one concept has different words). NLP doesn’t yet distinguish linguistic variation.

How Healthcare Organizations Can Use NLP Now

Even though NLP has challenges to resolve, health systems can still benefit while the capability evolves, starting with more attainable goals (the low-hanging fruit) and moving to more complex applications (the high-hanging fruit).

The Low-Hanging Fruit: Easy for NLP to Identify and Process

There are certain areas where current NLP is already effective:

  • Explicit mentions: when NLP is looking for chest pain, it will process the phrase “chest pain.”
  • Unambiguous vocabulary: the words have one meaning, no matter where they appear.

Successful NLP application in the low-hanging fruit category is a reality within some areas of healthcare, including predictive analytics and quality improvement. For example, a study assessed using NLP to process radiology reports to look for pulmonary embolism

(PE) and postoperative venous thromboembolism (VTE). It found that NLP and unstructured data captured 50 percent more cases than structured data alone would identify.

NLP also plays a current role in decision support. For example, NLP can flag patients from an EHR who have a first- or second-degree relative with a history of breast or colorectal cancer diagnosed before the age of 45. Once NLP systems flag those patients, a patient portal sends an email alerting flagged patients of their family history and increased risk of these cancers and recommends preventive measures.

Evolving Towards the High-Hanging Fruit

As NLP evolves and developers meet the current challenges of NLP, health system analysts will more easily access the high-hanging fruit. Getting to the high-hanging fruit requires more advanced capabilities:

  • Inference: If a clinician wants to know if a patient has social support but the phrase “has social support” isn’t in the EHR, next-generation NLP will be able to infer meaning from the context. Instead of relying on the exact phrase of “has social support,” the system will process a phrase like “brother at bedside” and know it means the patient has social support.
  • Ambiguous vocabulary: If an analyst programs the NLP system to look for the phrase “brother at bedside” and it sees the phrase “stood at bedside” or “brother died of heart attack,” it will know that the meaning is not the same despite at least one of the words being in the same. With current NLP systems, these two phrases would bring back false positives for meaning the same as “brother at bedside.”
  • Semantic roles: Current NLP system struggles with semantic roles. The phrase “wife helps patient with meds” is very different from “patient helps wife with meds.” However, a keyword-based NLP system cannot differentiate between the two phrases, which is a current limitation of NLP. In the future, NLP systems could be programmed to understand semantic roles (e.g., who’s the subject and who’s the object).

NLP’s Promise to Get More from Healthcare Data

Despite NLP’s challenges, the healthcare industry is beginning to embrace its potential to get critical insights from the wide variety of health data. Healthcare organizations are already using NLP to get at the low-hanging fruit, and major tech entities are leveraging NLP in health-related tools; Amazon, for example, recently released a user-friendly clinical NLP tool.

Many open-source tools are available at no cost—allowing users to do classification, find phrases, and look for contextual information that provides clues about family history. But to maximize NLP’s potential in healthcare, organizations need to look beyond these off-the-shelf solutions to healthcare-specific vendor systems that integrate into existing workflows. This strategic approach will fully leverage NLP to improve healthcare outcomes.

Additional Reading

Would you like to learn more about this topic? Here are some articles we suggest:

  1. Healthcare NLP: Four Essentials to Make the Most of Unstructured Data
  2. How Healthcare Text Analytics and Machine Learning Work Together to Improve Patient Outcomes
Four Steps to Effective Opportunity Analysis

This website stores data such as cookies to enable essential site functionality, as well as marketing, personalization, and analytics. By remaining on this website you indicate your consent. For more information please visit our Privacy Policy.