By Jonah Leshin & Quinn Johns
Health Data Privacy, Utility, and Synthetic Data
While the idea of health data privacy has long been important, practical implementation has become increasingly essential with the de-siloing of digital health information. The Healthcare Insurance Portability and Accountability Act (HIPAA) marked the beginning of modern privacy regulation, which continues to evolve as new forms of health data come online. Health data has tremendous research utility, but achieving high utility often requires making sacrifices on patient privacy.
Regulation moves the industry along this tradeoff curve towards privacy, reducing data utility. Over the past several years, advances in machine learning have given rise to new privacy-enhancing technologies for healthcares that have the capacity to shift this curve outwards instead of moving along it, enabling increased data utility without compromising privacy.
Synthetic data has the potential to be a critical technology in this space, as it enables representative patient data with inherent privacy protection. Moreover, best practice synthetic data generation processes provide the user with privacy controls that are configurable and quantifiable. Core, decades-old risk assessment principles provide necessary context for evaluating synthetic data. The application of these principles to synthetic data form a privacy framework that makes possible a range of previously unattainable use cases.
Synthetic Data Generation in Healthcare: How It Works
A synthetic data engine takes a real dataset as input and uses a machine learning model to generate an output dataset. The output dataset preserves the overall patterns and distributions of variables in the underlying dataset and thus serves as a realistic representation of the underlying real dataset. Notably, however, the output data is an entirely new dataset, distinct from the real data.
For example, if the input dataset has a date of service variable with the date of a patient’s most recent provider visit, the output dataset may contain dates not in the original dataset. At the same time, metrics such as the variance across all such dates in the original dataset will be approximated in the output dataset.
Advanced synthetic data engines can be configured to shape the output with regard to both utility and privacy. With regard to utility, a user may select a data generation model that is well suited to capture certain traits. For privacy, a user may set limits on metrics that quantify dimensions of risk in the output dataset. This flexibility and control allows synthetic data to be optimized for both strict and permissive privacy environments.
Why Synthetic Data Matters
The core value of synthetic data is that it enables organizations to access concrete and representative insights from sensitive data, while minimizing the risk to patient privacy and limiting governance requirements.
In particular, organizations can access and share synthetic data in circumstances where sharing identified or de-identified data is too difficult or even impossible.
Consider for example a pharma company with a commercial treatment for cancer in the United States. The company wants to understand the course of treatment and outcomes for patients with the same cancer in the EU, so they can prepare to enter that market. Under GDPR, this patient-level data cannot be shared with the US company unless the organization assumes significant additional safeguards, contractual terms, and governance processes. If a synthetic version of the EU patient data is created instead, the US company can access that synthetic data without requiring the same level of overhead – saving time and money.
Synthetic data can also play a role in accelerating insights prior to the exchange of de-identified data. One example is a lab testing company that’s seeking to license their de-identified lab records to an analytics company to create comprehensive patient journeys. The lab company first creates a limited synthetic dataset and provides it to the analytics company “presale.” The synthetic dataset has the same fields, formats, and statistical characteristics as the original data, which gives the analytics company an opportunity to evaluate whether the lab data adds new information to their current knowledge of the patients’ journeys. The same data format also enables them to get a head start on setting up the tools and environment that will be needed to work with the data. As a result both companies are able to save resources, by ensuring they only invest in partnerships that add value and by moving faster towards successful completion of their data projects.
These are just two examples of how synthetic data can be used to shift the privacy-utility curve outward and enable broad access to high-utility data without making tradeoffs on patient privacy.
Synthetic data enables many other use cases, from accelerating development and testing of machine-learning models to augmenting rare or limited patient cohorts. As the technology to generate and apply synthetic data improves, more use cases continue to evolve.
Risk Assessment Principles: Overview
Whenever patient data plays a role in the generation of a dataset – whether that be via a synthetic engine, encryption, redaction, or other means – it is important to assess the privacy risk associated with the dataset. Specifically, an expert should evaluate the risk of a patient being identified from a dataset, in consideration of what information could be learned or inferred about the patient if they were to be identified. The HIPAA privacy rule speaks to these risks, and can serve as guidance for assessing risk in health data that lies outside the scope of HIPAA, such as synthetic data.
Risk Assessment Principles Applied to Synthetic Data
When done properly, a record in the output synthetic dataset is not derived from any single record in the input dataset. Therefore, there is an inherent layer of privacy baked into synthetic data. Nonetheless, one must understand the nature of the privacy risks that may exist and how they can be measured. As a high level benchmark, any well designed engine should produce an output that would be considered very low risk in the context of HIPAA de-identification.
Privacy risk assessment of synthetic data focuses on the extent to which a synthetic dataset is representative of the input real data and the implications therein.
There are two metrics commonly used to quantify this risk:
- Attribute Inference.
The idea of attribute inference is that if we were able to associate a synthetic record with a real individual (perhaps an attacker knows some information about the individual a priori), we could infer new information about that individual based on the information in the synthetic dataset.
For example, suppose we have the following record in our synthetic dataset:
The fields shaded in green are quasi-identifiers that could be used by an attacker to disclose a patient’s identity. The field shaded in orange represents sensitive information.
If these quasi-identifiers match the quasi-identifiers from the underlying real dataset, there is risk that an attacker could both identify the patient and then infer sensitive information about the patient in question.
For further details about attribute inference, see , .
- Membership Inference.
Membership inference concerns an attacker’s ability to use the synthetic dataset to determine that a known patient record is included in the underlying real training dataset.
For example, suppose a synthetic dataset models the underlying real dataset with excessive fidelity. In this case, an attacker may deduce that if a known record is sufficiently similar to a record in the synthetic dataset, then it was in the underlying dataset.
Knowledge of whether a known record is contained in a dataset can be sensitive information, as, for example, a dataset may be known to consist of patients from a cancer registry.
“For further details about membership inference, see , .
The attribute inference and membership inference metrics can be quantified with precise numerical values. They should be viewed as inputs towards a holistic privacy assessment of synthetic data.
We next discuss how concrete privacy considerations underlie the evaluation of synthetic data privacy, including these metrics.
Concrete healthcare privacy considerations: One level deeper
There are specific factors that experts consider when assessing privacy risk of any dataset. In the table below, we outline several and illustrate examples of their application in the evaluation of synthetic data risk.
Synthetic Data-Specific Considerations
While synthetic data risk assessment is guided by the core principles outlined in Table 2, there are certain considerations that are specific to the synthetic data generation process. For example, the synthetic data engine’s algorithm should not be knowable so that the discovery of information about the underlying real dataset cannot be reverse engineered.
In a related fashion, if a user repeatedly generates synthetic data outputs from the same underlying real dataset, it is possible to triangulate the results to infer information about the real dataset (such as min or max values of a given variable). Therefore, repeated generation should be limited.
Synthetic Data in Healthcare: A Privacy-First Approach
When building technology to enable utility from patient data, it is best practice to incorporate privacy in the design phase rather than as a consideration after the fact. Synthetic data generation is a valuable illustration of this approach, as there is explicit dissociation between a real patient dataset and the synthetically generated dataset.
The strength of the privacy protection afforded by synthetic data can be understood in the context of core data privacy principles. With this understanding, we are able to pull forward certain health data use cases by years, accelerating the pace of research.
 El Emam, K., Mosquera, L., & Hoptroff, R. (2020). Practical Synthetic Data Generation.
 Matwin, S., Nin, J., Sehatkat, M., & Szapiro, T. (2015). A Review of Attribute Disclosure Control. Studies in Computational Intelligence, 567.
 El Emam, Khaled, Mosquera, L., & Fang, X. (2022). Validating a membership disclosure metric for synthetic health data. JAMIA Open, 5(4), ooac083. https://doi.org/10.1093/jamiaopen/ooac083 Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2017). Membership inference attacks against machine learning models. 2017 IEEE Symposium on Security and Privacy (SP).
 Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2017). Membership inference attacks against machine learning models. 2017 IEEE Symposium on Security and Privacy (SP).