Advances in healthcare have enabled the development of increasingly targeted and effective treatments. Data plays a crucial role in this innovation. For example, data linkages between
- Information collected from clinical trial participants at the start of a trial, and
- Information that captures their subsequent encounters with the healthcare system
Just as there are intricacies in targeted therapies, rare disease registries, and niche healthcare advertising campaigns, there are also data privacy considerations for the associated small populations. It is critical to understand this context in order to effectively utilize the data without compromising patient privacy.
Minimizing the risk of re-identification
Privacy is paramount when data linking. At the same time, in order for the data to retain utility, it must be shared in a form that preserves meaningful value from its original content.
According to the Health Insurance Portability and Accountability Act (HIPAA), identified patient data can be shared under a Business Associates Agreement (BAA). This is a legal agreement that ensures the proper use of protected health information (PHI). Due to the sensitivity of the PHI, however, sharing the data in de-identified form is often preferred. According to HIPAA, data is considered de-identified if a statistical expert has determined the overall risk of re-identification to be “very small,” and HIPAA permits data to be shared in this form.
In this context, the risk of re-identification of a patient in a dataset refers to the ability to use patient information in the dataset, in combination with reasonably available information (e.g., other publicly available datasets), to identify the patient. For example, the knowledge of a patient’s date of birth and zip code will likely narrow down the possible identities for the patient to a degree that puts the patient’s anonymity at risk. In practice, the minimal risk criteria often manifests as a requirement that at most some set percentage of patients in the dataset may be at elevated risk of re-identification.
Statistical volatility in smaller datasets
In the case of small patient populations, extra care must be taken to ensure this re-identification risk is very small. While there is no strict size cutoff for a dataset to be labeled “small,” privacy experts generally consider datasets with tens of thousands of patients or fewer to be small.
To be precise about the meaning of a dataset, we define it to be a set of patient records, where each record consists of information – medical, demographic, etc. – associated with the patient. Notably, however, datasets are not necessarily static; for example, a dataset might consist of COVID-19 test results from a particular testing location, which may be updated over time as more tests are taken.
Extra attention is required because small datasets are more statistically volatile. In particular:
- Small datasets often target small cohorts, which may inherently present higher levels of risk.
- The risk profile of a small dataset is more volatile. That is, it is more likely to have its distribution of features materially affected when adding a relatively small number of new patients; thus, the dataset is less likely to remain representative of its original population over time.
We now unpack these two assertions. Consider an analogy of an urn with 992 green balls and 8 red balls. The balls represent the population at large, with the red balls representing individuals with more outlier features (such as large height or membership of a smaller ethnicity group) that would make an individual more easily identifiable, and the green balls representing individuals with fewer outlier features.
By their nature, targeted studies focus on a specific segment of the population. These segments are often defined by a set of requirements across multiple attributes – for example, a requirement on age plus a requirement on disease status.
One of these requirements may restrict the dataset to a subpopulation of higher re-identification risk; for example, including year of birth information increases the risk of re-identification for older individuals since there are fewer individuals in their 70’s than in their 30’s. Therefore, a study requirement that includes a minimum age cutoff of 70 years would increase the risk of the corresponding dataset.
In terms of the urn analogy, this age requirement amounts to working with a sample of red balls. We can contrast this with a dataset that contains a representative subset of the population, which is analogous to drawing a sample at random from the urn, for which the expected proportion of red balls is 0.8%.
For the second assertion – that the risk profile of a smaller dataset is more volatile – we can continue our analogy and compare a large urn with 995 green balls and 5 red balls with a small urn containing 197 green balls and 1 red ball. Both urns have a similar proportion of red balls, approximately 0.5%.
We assume that the set percentage threshold for a dataset to present “very small” risk of re-identification is 1%. In this case, the large urn can have 5 new balls of any kind added without the red ball proportion exceeding 1%, whereas if only a single red ball were added to the small urn, the red ball proportion would exceed 1%.
The translation here is that the addition to or modification of a small number of patients in a small dataset could significantly change its risk profile. Such a change could, in theory, also decrease the level of risk, but the point is that smaller datasets possess this volatility that requires closer monitoring, whereas larger datasets are more robust to small changes.
The role of Privacy Experts
Privacy experts play a critical role in helping to mitigate the risk of small datasets. While a HIPAA privacy expert ultimately determines whether the risk of a dataset is sufficiently small, they also play a role in assessing the privacy impact of various data transformations that one might wish to make to ensure privacy is protected.
The path to an Expert Determination then becomes a collaborative effort, with a privacy expert providing guardrails and guidance on risk, and the data source providing perspective on the relative utility of various data elements. We now work through an example use case to illustrate how this might play out in practice.
Example: Lung cancer therapy clinical trial
Consider a dataset of 1,000 lung cancer patients recruited from the northeastern US who have been treated with a third line clinical trial therapy addressing specific mutations. Suppose that we wish to perform data linking between this data and a large health insurance claims dataset in order to analyze the procedures that the clinical trial participants have undergone before and after the trial.
As is the case with any Expert Determination process, at the outset it is important to conduct an internal evaluation of one’s dataset to understand the relative utility of fields of interest and the underlying data types. In this example, suppose the study focuses on procedure types, dates, and costs, but that the patient’s residential location is not relevant.
With this information in hand, the data owner may choose to redact all information related to the patient’s home geography (note that this information may come from either the small dataset, or the larger claims dataset, since they are linked). Since such location information is publicly available, redacting it reduces the risk of re-identification.
Due to the nature of the therapy being studied, the genetics component of the dataset may also be important. Privacy considerations around genomics data continue to evolve as genetic testing becomes more prevalent and the increase of publicly available genetics databases creates new privacy dynamics. The extent to which certain genetic information may be permissible should be discussed with a privacy expert.
Suppose that approximately eight new participants are enrolled in the trial and added to the dataset each month. Using the aforementioned criteria of requiring fewer than 1% of patients to have elevated risk, we can be confident that adding precisely 8 new patients to the dataset of 1,000 total patients will not compromise the dataset’s risk profile. However, given the potential variance in the number of patients to be added, it would be recommended to revisit the Expert Determination analysis on a regular (e.g., monthly) basis or ad hoc as new patients are added to the dataset.
The privacy impact of a dataset update depends on the characteristics of the population that is being added. If the new population is representative of the existing dataset population, then we would not expect changes to privacy risk. On the other hand, if recently recruited clinical trial patients were, for example, of a less common ethnicity, which was not previously represented in the dataset, further data operations may need to be applied in order to meet HIPAA compliance.
A privacy expert who evaluates the risk of a dataset is equipped to explicitly enumerate these operations, known as remediations, that are necessary to reduce risk. Remediations may take a variety of forms including redaction or truncation of certain fields (e.g., truncating a zip code to its first two digits), removing specific values in a field (e.g., dates associated with a procedure that implies a birth), or the removal of a risky subset of patients entirely.
Remediations are common in the Expert Determination process, as privacy experts account for risk not only within individual data fields, but also risk that can arise from the interaction of data fields. For example, for certain records, the combination of zip code and ethnicity may present greater risk than the risk of either of these fields in isolation. There may be discretion in choosing a satisfactory combination of remediations; thus, a qualitative understanding of the use case should inform the decision of which data operations to apply.
With proper diligence, small datasets can unlock novel use cases
While smaller datasets require care and an understanding of the privacy principles at play, we must be able to share information collected from smaller population groups in order to inform research in genetic sequencing, drug development, and medicine more broadly. We have the necessary privacy mechanisms to work with this data, and the opportunities merit strong pursuit.
Editor’s note: This post has been updated on December 2022 for accuracy and comprehensiveness.