Detailing the nuances between some of healthcare’s most common (misused) data-related terms
We are all guilty at one time or another of using words or phrases interchangeably with different root meanings. And when it comes to healthcare data, understanding these nuances and aligning on definitions make a big difference in protecting such sensitive health data information. Sometimes simple words have complex meanings and implications and can even vary by industry.
In healthcare, de-identified health data matters. And while HIPAA protections around patient privacy limit how much identifiable data can be disseminated, and rightfully so, this protection also limits the context for comprehensive data sets to be effectively analyzed. Methods like tokenization can often be the solution. We’ll explain all this and more as you make your way through this series of blog posts.
Throughout this 3-part blog series, you’ll see an example of our fictitious patient, David Smith Jr.*, who will help better illustrate the nuanced differences and interconnectedness with these common healthcare data terms.
You’ll gain a high-level look at anonymized healthcare data, de-identified healthcare data, tokenized healthcare data, and where expert determination fits into the mix.
In healthcare, what is de-identified data?
According to the article Understanding De-Identified Data, How to Use it in Healthcare, “de-identified data in healthcare is, the process of de-identification, which removes all direct identifiers (e.g. name, social security, etc. ) from patient data and allows organizations to share it without the potential of violating the Health Insurance Portability and Accountability Act (HIPAA).”
The de-identification process may also remove or aggregate certain in-direct identifiers (e.g. ethnicity, unusual occupation, extreme age, etc.) to guarantee the re-identification risk below an acceptable low level as defined by HIPAA. These de-identified health data sets do not contain any data that can explicitly and directly identify patients.
As the healthcare ecosystem evolves, and the way in which we leverage data for informed insights is growing we need to make the distinction that just because data is de-identified does not mean that it is anonymized – and most of the time it in fact is not. These terms are not one in the same.
And when it comes to HIPPA, there are two approved de-identification methods to ensure adherence to HIPPA:
- The Safe Harbor Method, and
- Expert Determination
With the Safe Harbor Method, 18 is the magic number. 18 represents the number of specific types of data that MUST be removed from a health record to quality under the HIPAA “Safe Harbor” De-Identification Method.
Those 18 protected health identifiers are:
- Dates (except year)
- Telephone numbers
- Geographic data
- Fax numbers
- Social security
- Medical record numbers
- Account numbers
- Health plan beneficiary numbers
- Certificate/license numbers
- Vehicle identifiers and serial numbers including license plates
- Web URLs
- Device identifiers and serial numbers
- Internet protocol addresses
- Full face photos and comparable images
- Biometric identifiers
- Any other unique identifying number, characteristic, or code
For example, let’s take our fictitious patient David Smith Jr.* who suffers from diagnosed depression. In this instance, a life science company may want to de-identify David’s patient data to find insights into other depression patients by creating a cohort of others that “look” like him. Since enough personal information is stripped away, this minimizes the risk of identifying David Smith Jr. Creating a cohort of patients who are like David Smith Jr. can help identify patients for future clinical trials or research studies.
A more widely used methodology making its way into healthcare data trends is expert determination. Expert determination requires that an independent expert certify the re-identification risk inherent in the data is minimal and enables datasets to be HIPAA compliant. As noted, expert determination is one of the two ways that are approved to ensure data is protected under HIPAA. The other was under Safe Harbor.
The four principles used by these experts to determine if a dataset is sufficiently de-identified are:
- Replicability: How likely is it that the same value will consistently occur with reference to the individual, i.e. how stable is the data value? A patient’s blood pressure would have low replicability since it can vary over time or data sources, but an adult’s height will tend to be highly replicable since it will be generally stable over time and across data sources.
- Data source availability: How widely available is the data? A patient’s date of death could be publicly available through obituaries or government death records, but it would not be common for a patient’s diagnosis of Type II Diabetes to be available in public data sources.
- Distinguishability: How likely is it that the patient can be uniquely described by the data elements? Knowing a patient’s 5-digit zip code, date of birth, and gender is enough to uniquely identify many individuals, but if we generalize that information to 3-digit zip code, year of birth, and gender, then the data is a lot less distinguishable and, in most cases, we are not able to uniquely identify the individual.
- And risk assessment: What is the risk of re-identifying the patient? The above three factors contribute to this calculation. Demographic variables such as zip code or date of birth tend to be replicable and distinguishable in addition to being available in public data sources so the risk of reidentification is high. Other attributes in the data such as blood pressure or diagnosis tend to be low in one or more of the above factors, which means that inclusion of these data points results in less risk of reidentification.
De-identified data does not include any direct identifier but can still contain indirect identifiers (e.g., gender, race, age) and non-identifiers, which allow researchers to study health data when:
- developing predictive analytics,
- addressing healthcare gaps within a patient journey,
- advancing medical research and treatment,
- and leveraging artificial intelligence initiatives.
Safe Harbor vs. Expert Determination
We’ve described the two methods of de-identification deemed appropriate in HIPAA regulated settings, but let’s break them down further. While both methods provide sufficient securing and safeguarding of patient information, each provides its own value when analyzing patient health outcomes. Now you may be wondering, when and why expert determination?
Why Opt for Expert Determination?
Expert determination is often the preferred method of anonymity when data will be utilized for research purposes, particularly when the cohort of interest has a small sample size. The goal of de-identifying data is to retain robust or impactful variables while statistically mitigating the risk of re-identification.
Safe Harbor can be thought of as a big black marker, omitting personally identifiable information as a redactive method, whereas expert determination leverages a more tactical approach beyond Safe Harbor. Expert determination allows the analyst to keep variables of which you need more detail and categorize or blur components that do not require the detail to make an informed inference or hypothesis.
In short, expert determination puts robust context behind the data making it more suitable to meet unique business and research needs. If data is de-identified under Safe Harbor, all time periods are stripped out, making it only a good choice for annual reports. There’s a time and place for both so working with an expert to understand your goals can put you on the right track.
In our next post, we’ll dive into how anonymized data in healthcare fits into the mix and continue to track David’s journey.
*Please note, all names, diagnoses, and other information included in this infographic are fictitious. They are included for illustrative purposes and do not identify any actual persons (living or deceased).