Data Collection Tools in Healthcare: What You Need

My Folder

Data Collection Tools in Healthcare Enterprise Data Warehouses (EDWs) are gaining widespread popularity in healthcare because they are designed to make data collection in healthcare possible and easier to analyze by aggregating data from multiple sources (source systems) into a single, integrated data repository.

But even the best and most comprehensive data warehouses may be missing some key data. Why is this? Quite simply, the data is not being collected or it is not being collected in analytic format. For example, there may be no field in your EMR for capturing a newly discovered genetic marker, or a lab result might be documented in a text field rather than in a discrete one.

Sometimes, the ideal solution to this problem is to get a change made in the EMR so that this data can be captured in a discrete field. However, in the real world, getting this change pushed through in the source system could take weeks, months, or (unfortunately) years.

Other times, it may not even be feasible to collect the data in an EMR. Perhaps the data is related to a research study, and it doesn’t make sense to ask busy doctors and nurses to enter the data in their already crowded workflows.

The usual workaround for this problem is to enter the metric into a one-off Access database or an Excel spreadsheet (affectionately known as a “spreadmart”). This solution seems practical, but it is also highly problematic and one of the reasons data architects often prefix spreadmart with other colorful vernacular.

Problems with One-off Solutions to Data Collection in Healthcare

There are three significant problems that these one-off databases and spreadmarts create.

Problem #1: Data Quality

Spreadmarts and one-off databases create real data quality problems, particularly when they are maintained separate from the source systems for a long period of time. To illustrate how data quality can suffer, I’ll use the example of medical record numbers (MRNs). At some point, in order to perform analyses using the data in the spreadmart in conjunction with other data points in the EDW, you’ll have to marry that data back to data in the EDW. Matching MRNs is a foundational aspect of this process. But what if you mistyped some of the MRNs when entering them into your spreadsheet? Or what if some of the MRNs have changed in the source system over time?

This is a problem I encountered while managing the EDW at a university health system. One of our research teams was using a spreadmart to track cancer tumor markers. They would pull the markers from text-heavy pathology reports and enter them as discrete values in an Excel spreadsheet.

After maintaining this “database” for several years, they approached me and asked me to upload their data into the EDW so we could run an outcomes study of the various tumor markers. That’s when things fell apart. We were unable to match a significant percentage of the patients in the spreadmart to patients in the EDW—all because of the MRN data quality issue. In the long run, we were able to match the patients, but it took a lot of manual effort.

Problem #2: Collaboration

Any self-created data collection tool like an Excel spreadsheet becomes yet another data silo for your organization to deal with. If you’ve implemented an EDW, you did so with the intention of eliminating data siloes, so these new siloes are at cross purposes to your enterprise-wide strategy.

Collaborating on any data silo is a challenge. One of the messiest problems you’ll encounter is that of versioning. Consider this example. Let’s say you have two people who need to enter data into the spreadsheet. They both have time in the evening to enter that data. So you email the spreadsheet to them—a potential security breach, but we’ll cover that topic in a moment—and ask them to enter the latest records. They both start writing data into their own versions. They notice a few mistakes in previous entries and correct them. They send their versions back to you, and you don’t know what records have been changed or how to merge those all back into a single source of truth.

One way to solve this problem is to lock the document so that only one person can edit it at a time. But this approach doesn’t fit everyone’s schedule (imagine carving out one free hour in your day for data entry only to find that the spreadsheet is locked for editing). Single-threading data entry rarely results in the most effective use of time.

Problem #3: Security

Data security is a huge concern for healthcare organizations. The federal government is holding organizations accountable for HIPAA breaches by levying significant fines. Just this past April an organization paid a $1,725,220 settlement because of unencrypted data on a single stolen laptop.

The sad truth is that spreadmarts and other one-off data collection tools are often stored on laptops that may not be using the latest security procedures. Even if the spreadsheets are stored on encrypted laptops, collaborating on those spreadsheets tends to introduce security problems. People might email the spreadsheet back and forth. Or they might transfer the file from one encrypted laptop to another via an unencrypted thumb drive. They might even upload it to a shared, unencrypted drive on Google or Dropbox (yes, this does happen). The possibilities are endless, quite common, and rather frightening.

A Great Data Collection Tool Solution: Instant Data Entry Application (IDEA)

A data collection tool for healthcare analytics that solves all of these problems is the Instant Data Entry Application (IDEA). I typically try to avoid blogging about Health Catalyst products, but the truth is that this tool is unique in the way that it addresses big, real-world problems that I’ve faced throughout my career in healthcare analytics.

IDEA is a simple tool than runs on the EDW and lets you design any number of web-based data collection forms. It allows you to supplement data collection when data required for improvement initiatives or research is not collected by your existing source systems.

By using IDEA, you avoid the three problems I described above. Briefly, here’s how:

  1. Data quality. IDEA pulls master data from the EDW. So when you enter data on a patient, the only data you enter is the new data that isn’t available in another source system. There is no opportunity for typos or other manual data entry errors. You don’t type in the MRN; you select from a list of existing patients. You don’t type in an encounter; you select from a list of existing encounters. Since every piece of data you enter can be proactively linked to a record in the EDW, it is trivial to marry data from IDEA to other data in the EDW. Automatic data flows to copy IDEA data into the EDW make this even easier. Additionally, IDEA allows the user to put meaningful constraints on data (you can’t enter a year that doesn’t exist or enter letters where only numbers would be appropriate).
  2. Collaboration. IDEA is a centralized server application that enables you to control who can see and who can edit each data collection form (or application). Those people who have access can simultaneously edit data within IDEA without stepping on each other’s toes. This eliminates any versioning problem.
  3. Security. IDEA gives you fine-grained control over who can access the data within each data collection form. It sits on a secured server behind the firewall. When people look at data, they don’t need to download it to their devices. Importantly, all security challenges associated with collaboration are eliminated.

Every day, healthcare professionals are collecting data in Excel spreadsheets and one-off databases. These people are collecting very valuable data. They simply aren’t using the best methodology. IDEA offers an ideal data collection methodology in an environment where analysis, collaboration, and data security are more important than ever.

What are your experiences with Access or Excel “databases”?

Loading next article...