Where AI Meets Health Data Privacy: How to Balance Innovation with Risk

David Williamson, PhD, Senior Data Scientist and Privacy Expert

May 28, 2025

min

Table of Contents

Artificial Intelligence (AI) has experienced an increase in both interest and scrutiny, particularly with the rise of accessible large language models and generative AI tools such as ChatGPT and Midjourney. In response to increased AI usage in healthcare, governments are enacting laws regulating AI, particularly focused on patient privacy. While these laws may reduce some risks of AI, they may also introduce challenges for compliant implementation.

AI offers significant value to healthcare and life sciences companies by enhancing health data analysis and providing insights for medical research, patient care, and commercial use. To protect patient privacy while using AI tools, de-identification is a particularly important strategy and can exempt health information from regulations such as HIPAA.

However, AI models are complex and opaque, and effective de-identification can be difficult. A deep understanding of AI-related privacy risks, and strategies to mitigate them in the framework of HIPAA, is vital to enable AI use that is impactful, compliant, and privacy preserving.

In this blog post, we address:

The opportunities for the use of AI in health data
Strategies to protect patient privacy and maintain HIPAA de-identification when AI is used with health data.

What is AI?

While the term “AI” is used to encompass a variety of technologies, we will focus primarily on “Machine Learning.” Machine Learning (ML) models “learn” information through a training process, where the ML algorithm analyses patterns in the training dataset and stores these insights as a “trained model.” This training process may be supervised, with a human user providing input (e.g., “picture A has an owl in it; picture B has a robin in it”), or unsupervised, where the algorithm categorises without human intervention. The trained model can then be used to make predictions, categorize information, or generate other outputs, based on the patterns it has recognised and stored.

While the primary strength of an ML model is its complex and dynamic nature, it is also a source of risk. While the training algorithm might be well understood, a trained model itself contains a large amount of information that is near-impossible to directly interpret. The performance and disclosure risks of a ML model are, therefore, difficult to predict, and must be estimated through empirical testing of the model.

This presents a challenge for compliant use of ML models. While some uses of ML with identified health information may be compliant with HIPAA, in many cases the data must be made exempt from the HIPAA Privacy and HIPAA Security Rules through de-identification. Health data from a HIPAA-covered entity can only be de-identified through two methods:

‍Safe Harbor: Removal of a specific list of identifiers, and no “actual knowledge” that the information could be identifiable.‍
Expert Determination: A documented assessment by an appropriate expert that the dataset has “very small risk” of re-identification.
‍

Expert Determination is often preferred, as the requirements of Safe Harbor are inflexible and may not provide an optimal utility/privacy balance. However, the complexity and lack of transparency in ML models creates technical challenges for the expert’s assessment. Nevertheless, with appropriate care and expertise, these challenges can be addressed.

Using ML with de-identified health data

There are many situations where leveraging ML with de-identified health data can unlock better insights or new use-cases.

‍Health data enrichment: ML models can analyse patient data and make clinical predictions. An example would be predicting the likelihood that a patient has a condition based on inference from diagnosis and procedure codes that do not directly indicate the condition. This can improve analysis, by facilitating a better selection of patient cohorts for comparison studies, as an example.‍
Next best action: ML models have also been used to improve direct care. In some cases, an ML model generates suggestions for diagnoses based on symptoms entered by a physician in combination with the patient’s medical history. In other cases, ML models are used to identify features in medical imagery. In all cases, these models are trained using large datasets of health information.‍
Identifying target populations: ML models can leverage insights from a de-identified dataset to identify populations that may benefit from a particular drug or therapy. The first step is to leverage health data to derive general insights (e.g., “white men aged 50+ have an elevated risk of diabetes”). These insights can then be used to create groups that might be more likely to benefit from a given intervention, without ever directly linking any individual to health data. Note, any approaches that involve direct marketing to individuals require additional analysis under state consumer privacy laws.

Re-identification Risks of ML in health data

The goal of using ML in a non-identifying way with health data is for an algorithm to learn general patterns and trends from a training dataset, and then apply these insights to another dataset or to other parts of the same dataset without linking any identifiable individual to their own health information. Unless used with care and expertise, the risks of disclosure and re-identification of health information can be significant.

A core risk in ML models is “memorization” or “overfitting.” Instead of the ML model learning the general trends in the dataset, the model learns the specific traits for each individual. For example, instead of the model learning that individuals over 50 years old have a higher rate of diabetes, the model learns that there is a very high likelihood of a diabetes diagnosis in individuals born on Aug. 3 1968, who own two cars, have signed up to one particular TV streaming service, are divorced, have three adult children, and shop at a particular grocery chain. This “trend” may only describe a single individual, and therefore has a high risk of re-identification.

Memorization is caused by overfitting. This is not a unique feature of ML, but is simply a result of the large amount of information stored within an ML model. Even for simple non-ML models, the more parameters a model has, the more potential it has to be overly tuned to the data of specific individuals, rather than to the general trends of the data. This increases the risk that the model’s predictions will reproduce the real data of individuals in the training dataset.

In this situation, privacy and utility can and should be aligned. Any model that is overfit will not only have an elevated risk of memorizing and revealing the underlying training data, but will also have poor accuracy when making predictions for individuals who are not within the training data.

Maintaining effective de-identification of health data in ML processes

There are a number of methods and technologies that can be applied to mitigate the risks of re-identification of health data when using ML. Generally it is not possible - nor even advisable - to completely reduce the risk of re-identification to zero. The HIPAA Expert Determination standard acknowledges this by only requiring the risk to be “very small.” * Note, this principle is acknowledged more generally beyond HIPAA, such as in the UK ICO advice for anonymisation under UK GDPR.

Approaches to de-identify data (and keep it de-identified) should therefore be based on estimates of risk and strategies to reasonably mitigate such risk. We advise that risk management for ML involves not only an assessment of technical and statistical details, but a pragmatic assessment of the context and environment, the likelihood and mode of an attack, and the sensitivity of the data itself.

Methods to manage risk

‍Leverage extensively de-identified data for model training: A simple approach to maintain de-identification is to ensure that the training data itself is so thoroughly de-identified that even if it is possible to use the ML model to disclose its training data, the disclosed information still remains de-identified. Standard statistical frameworks such as k-anonymity (see below) can be used to manage risk of re-identification. However, there may be data utility loss when k-anonymity (or similar) is used alongside large numbers of linkable variables. This approach can be effective and practical if the training dataset is simple, or does not contain potentially linkable information, but otherwise may not be the most suitable approach.

‍K-anonymity is a statistical framework that mitigates the risk of re-identification by requiring that every combination of identifying information in the dataset corresponds to a group of individuals (of size ‘k’ or above) and never to a single individual. For example, for k=5, if the only identifying information in the dataset is age, sex, race, and city of residence, it would not be possible to single out the information of a 50-year-old white male in Atlanta, as this would correspond to at least five individuals in the dataset.

Work within a fully de-identified environment: If the model is not used to make predictions regarding identifiable individuals, then even if an individual’s health data is revealed, this health data could remain de-identified. Techniques such as k-anonymity may still be required to determine that the individuals have “very small” risk of re-identification, but considerably more data utility could be retained than when making predictions regarding potentially identifiable individuals. This approach may be particularly useful when using ML models to assist in analysis of de-identified datasets, such as the above example of using ML to impute predicted conditions to enrich a de-identified health dataset.

Conduct model testing and calibration: It is particularly useful to use empirical tests of the trained model to determine whether the model has a more than “very small” risk of re-identifying an individual. This typically works as follows:
1. Some portion of the training dataset is reserved for testing (20% is a common standard).
2. The model is tested on both the training and testing datasets, and the model performance for each dataset is measured with standard statistical metrics, such as precision, recall, and the “Area Under the Curve” (AUC).
3. If the test and training datasets produce similar model performance results, this can be an indicator that the model has correctly learned the general traits, rather than memorizing the data on any individual.
4. If the model performs considerably better on the training than the testing dataset, this could indicate it has been overfit to the training dataset. The model could have a significant risk of disclosing its training data. In this case, the model training process should be adjusted, and the process repeated.

It is important here to closely attend to technical nuances in the choice and interpretation of performance metrics, to prevent inaccurate estimates of disclosure risk.

Enforce data separation: The ML model itself can be protected in a separate environment which controls the risk of linkage to other datasets (and potential disclosure as a result), such that only approved model outputs can be exported. This is useful where a trained model involves tuning parameters (such as a “confidence threshold”), and controlling disclosure risk for all parameter configurations may be difficult or impossible. However, if the output of a ML model is a simple structured dataset, this can be de-identified with more traditional methods, and exported with confidence. Data separation can be managed by:
- Security and contractual controls
- Privacy enhancing technologies such as encryption
- More complex technologies such as Federated Learning and Homomorphic Encryption

However, de-identification of the exported data becomes more complex as novel data types are introduced, such as complex natural language data, images, sounds, or video.
Managing privacy risks for ML models operates at the intersection of technical security, statistical risk assessment, and legal compliance. It is imperative (and, indeed, a legal requirement for the Expert Determination method) for healthcare organizations navigating privacy risk in ML models to work with data experts who are well versed in de-identification of health information and in the disclosure risks of ML models. There are potential risks in a simple application of methods such as k-anonymity, which can result in both poor data utility and insufficient privacy protection, particularly if all potentially identifying information is not fully accounted for. Partnership with an industry expert, like Datavant’s privacy solutions, is key to ensure that de-identification is performed with the most appropriate statistical and scientific principles and methods, and with a sufficiently deep understanding of risk in the dataset and processes.

Key Takeaways

AI holds promise in advancing insights from health data: ML enables valuable insights from health data—such as clinical predictions, diagnostic support, and cohort enrichment—while preserving patient privacy, when de-identification is properly applied.‍
De-identification is essential: HIPAA-compliant de-identification (particularly via Expert Determination) enables broader use of health data in ML applications, and requires specialized evaluation due to the complex and opaque nature of ML models. ML models that overfit can “memorize” individual patient traits, elevating re-identification risk and reducing model accuracy. Proper model training and empirical testing are essential to mitigate this.‍
Practical risk mitigation is possible: Techniques such as k-anonymity, secure data environments, model testing, and data separation help manage disclosure risks. These must be implemented thoughtfully and in context.‍
Privacy expertise unlocks greater value when leveraging AI: Engaging data privacy and ML experts empowers organizations to maximize both data utility and privacy. With the right expertise, it’s possible to design models and processes that are both compliant and high-performing—unlocking the full potential of de-identified health data for innovation and impact.

Responsible Innovation in Health Data AI

AI, in the form of machine learning, is a powerful tool for gaining insights from data and making predictions. De-identified health data can be used for machine learning, but this must be done in a privacy-forward manner. For the health data to continue to have “very small risk” of re-identification and continue to be de-identified under HIPAA, careful attention must be paid to the technical details such as the security environment, model configuration, output, and the training dataset. Trained and experienced experts are crucial for a full accounting of such details and effective management of risk.

Ready to maximize the utility of AI and machine learning in your data, while protecting patient privacy? Get started with Datavant privacy solutions.

‍