Blog /

What Does It Mean to “De-identify” Patient Data?

Publish Date
Read Time
Jamie Blackport
April 19, 2022

First in a three-part series: “Patient privacy and healthcare data exchange: What privacy and compliance officers need to know to de-identify patient data and stay HIPAA compliant.

Privacy and compliance officers across the healthcare ecosystem today are part of an exciting new landscape of connecting data to improve patient outcomes. Sharing patient data can improve healthcare continuity, accelerate clinical research and help estimate and manage patient health risk factors.

One major challenge is ensuring that patient data sharing with third parties is done in a manner that protects patient privacy at every step.

In this three-part series, we tackle several important questions facing privacy and compliance officers. We start with how and when an organization’s data is considered “de-identified” under the 1996 Health Insurance Portability and Accessibility Act (HIPAA).

What does it mean to “de-identify” patient data?

One way to protect patient privacy is to remove identifying information from the health data in question. This can include removing the patient’s name, address, date of birth and any other information that could enable their identity to become known.

A first step is to tokenize patient data – the process of removing or modifying personally identifying information (PII) and creating anonymized and encrypted records that can be aggregated for research and analytics purposes. Strong tokenization schemes must ensure that the reverse engineering or correlation to other available identifying information will not risk re-identifying any given patient.

A common misconception, even among those who are familiar with HIPAA, is that patient tokenization equals patient de-identification. While tokenization is a critical step, it is not enough under HIPAA for the information to be considered de-identified.

There are only two ways to render patient data “de-identified” under HIPAA: Safe Harbor and Expert Determination.

De-identify patient data with Safe Harbor

Under the Safe Harbor method, a predetermined set of 18 data values must be redacted from a patient dataset. Safe Harbor is a highly prescriptive method that protects patient privacy. However, it also greatly reduces the utility of the remaining dataset for research purposes. A straightforward example is that removing service dates, such as admission and discharge dates, means that information that could help understand disease progression is no longer available in a dataset.

De-identify patient data with Expert Determination

The second option is conducting an Expert Determination, which mandates human expert review. In this scenario, a tokenized dataset derived from PII is analyzed by a human expert with deep domain expertise in statistics and data science. The expert performs a statistical analysis of the tokenized values to determine if the tokens pose a very small risk of re-identification or if further “remediation” or data removal is needed.

How do I choose between Safe Harbor and Expert Determination?

If Safe Harbor and Expert Determination can both ensure the compliant de-identification of patient data, which method is best?

Expert determination is typically preferred for the de-identification of patient data for organizations looking to make their data available for further research and analysis. The primary reason is that expert determination provides a great deal of flexibility when trying to balance the utility of patient data without compromising patient privacy.

For example, in rare diseases with just a few thousand diagnosed patients, the patient diagnosis, zip code and a few other data elements may raise the risk that someone could determine a patient’s identity.

In such cases, the expert may recommend redaction, modification or removal of additional data elements in the existing tokenized dataset to reduce risk of re-identification. These recommendations are typically documented in a written expert determination report, otherwise known as a certification.

Organizations must then implement these recommendations as remediations to their existing tokenized patient data. Once this step is complete, the remediated dataset must receive a final review and “certification” by the human expert.


Making data shareable starts with ensuring patient data meets HIPAA de-identification requirements. Patient tokenization is the first step followed by HIPAA de-identification methods of either Safe Harbor or expert determination.

The choice depends on the balance between business needs, research objectives and, most importantly, patient privacy protection.

De-identification of patient data can be challenging, laborious and time-intensive taking months to complete. But it doesn’t have to be – if you have the right expertise. This is the first in a three-part series on exploring the three key things every healthcare privacy and compliance officer needs to know about de-identifying patient data. Next,  we explore expert determination in more detail and how technology can complement and accelerate the work of human privacy experts to bring transparency and speed to a critical part of healthcare data exchange.

Editor’s note: This post has been updated on October 19, 2022 for accuracy and comprehensiveness.

Achieve your boldest ambitions

Explore how Datavant can be your health data logistics partner.

Contact us