How do you handle natural language processing (NLP) of unstructured data?

Response: Health Catalyst is currently developing an NLP subsystem that is intended to become part of the standard Health Catalyst Late-BindingTM Data Warehouse platform. We refer to the project as Pragmatic Natural Language Processing, or pNLP. As the name suggests, we are not attempting to solve the entire NLP problem for healthcare. We are focused on addressing a subset of this huge problem, rather than solving the more general problem which is still largely unsolved. We are focused on a pragmatic subset that will enable the extraction of additional data currently trapped in various unstructured text blocks. We are focused on basic text processing using techniques like regular expressions, word tokenization, sentence segmentation, word normalization and stemming, as well as edit distance (Levenshtein) to extract information of value from physician notes as well as other clinical documents. Examples of potential targets include medications (frequently misspelled), diagnosis, common codes (e.g. ICD9, ICD10, RxNORM, NDC). As a starting point, we are leveraging common, existing technologies that support Approximate Regular Expressions (AREs). AREs support minimal edit distance calculations using Levenshtein edit distance algorithms to determine the degree of insertions, deletions and substitutions in text. These techniques help manage misspellings and other problems common to text. Technologies like eGrep and TRE are examples of representative technologies that perform these functions very well. Negation is another feature we will support. Studies show that 50% of concepts in clinical reports are negated, so support for negation is essential. Negex and ConText are two technologies we are exploring as part of our solution to deal with common negation as well as pseudo negation terms. Examples of potential negations include: “patient denies chest pain,” “no indication of congestive heart failure,” and “no shortness-of-breath.” The aforementioned tools address many of these negations cases. Regular expressions with negation support are of fairly low value on their own. Therefore, we are adding support for algorithms that are sequences of regular expressions which form a pipeline to operate on sentences and larger text blocks. Algorithms allow one to link a number of regular expressions together (basically subroutines) in interesting combinations and facilitate reuse of regular expression components. Concept discovery and indexing, leveraging technologies like MetaMap from the NIH and Knowledge map, allow us to do more general data mining over results generated by the regular expression pipelines. These techniques are typically called Named Entity Recognition techniques. Lastly, we plan to include support for common lexicons. ICD, SNOMED-CT, UMLS, etc. will be leveraged for concept searches. We are exploring the inclusion of other UMLS lexicons that are typically licensed as optional extensions to the lexicon sets. In summary, we are leveraging two complimentary approaches, Approximate Regular Expressions and Concept Discovery & Indexing, to create a pragmatic NLP solution that will be fully integrated into the Health Catalyst solution in a future release.