Privacy Frontiers in Health Data: Genomics (Part 2)

Publish Date September 20, 2022

Read Time 12 min

In Part 1 of this article, we discussed the evolving scientific landscape of disclosure risk in genetics data. In Part 2 we will explore the legislative outcomes of this development. In particular, we will analyze the requirements of the HIPAA Omnibus Rule (2013) and propose a framework for how privacy experts should apply these in practice. Later in this article, we will evaluate how certain research questions could prove key to further advancing the sophistication and robustness of that approach. Part 2 contains more technical detail than Part 1 and presumes a general familiarity with privacy analysis for health data.

What is the current legislative stance?

An auspicious circumstance surrounding the HIPAA Omnibus Rule was that it was introduced just one day before the anticipated and impactful publication of the Erlich study (Gymrek et al (2013)). The inclusion of genetic information is the first and only amendment to HIPAA to address this type of health data, and therefore remains the legal standard to which de-identification of genetic data must be judged.

HIPAA permits a dataset to be de-identified through one of two approaches:

Application of Safe Harbor provisions, which involves the removal or prescribed modification of 18 identifiers (name, address, date of birth etc.).
An Expert Determination to be performed by an expert employing rigorous statistical and scientific principles to assess whether the risk of re-identification for a dataset can be considered ‘very small’ and to determine bespoke modifications to the data required to meet this threshold.

In this context, a few key features of the HIPAA Omnibus Rule pertaining to genetic information warrant consideration:

Uniquely identifying genetic information is considered to be PHI under HIPAA.
Genetic information is not included as a Safe Harbor identifier.
Genetic data is not precisely defined.
Genetic data for family members of an individual in the dataset is also subject to the Rule. Further, fetuses and embryos of individuals and family members are similarly covered.

Two broad conclusions can be drawn from these points:

Genetic data can only be assessed by Expert Determination, not Safe Harbor.
The default assessment of genetic data is binary in nature — it must be removed as presumed PHI unless an expert determines that the risk is negligible in a particular case.

The latter stance, in particular, is the inverse of the traditional logic underpinning health data assessments in Expert Determinations; typically, the expert operates from the presumption of permitting data unless they judge that it could carry significant risk and accordingly requires removal or modification.

Whether assessments of genetic data must be truly binary — with data either fully excluded or entirely permitted — is something of an open question. We will see that the availability of robust reference data for the proliferation of certain genetic information through the US population renders the quantification of genetic disclosure risk a steep statistical challenge.

But before tackling that issue, it is worthwhile to explore how the ‘default approach’ framework may play out in practice with the common categories of genetic data appearing in the commercial and clinical space.

Genetic data risk assessment in practice

Examples of the default approach in practice should begin with an appreciation that different types of genetic data fall in different places along the spectrum of direct-indirect identifiers that can be used to characterize all information in a risk context. Direct identifiers are associated with single individuals or very small groups by their nature — we would think of names, social security numbers, and addresses this way. Indirect identifiers tend not to point to an individual in isolation, but in combination can significantly elevate risk. Consider how knowing the age, gender, race, state, and ZIP code of an individual decreases the size of the population group to which they may belong.

There is an imperfect relationship between this spectrum and our definition of direct/indirect genetic information: direct sequencing genetic data is more likely to tend towards a direct identifier as the length and uniqueness of the sequence increases, while indirect genetic data contributes to the risk of inferring an individual genetic profile when combined with other pertinent information.

The progression of sequence data towards direct identifier status would conclude with the extreme scenario of a fully sequenced human genome. As this would be uniquely identifying, it should of course be impermissible under HIPAA if it could be related in another data source to the name of the individual who provided it. There is no immediate prospect of full genomes becoming widespread in health data, but short and long sequences are not uncommon and increasingly prominent in health records.

Variants in sequences

Longer sequences are vulnerable to being more identifying in part because they are more likely to feature variations like rare single nucleotide polymorphisms (SNPs) or highly precise combinations of more frequently occurring SNPs. These variations describe the alteration of nucleotides (basic units of DNA) at specific locations in the genome — for example, a cytosine (C) nucleotide being replaced by a thymine (T) nucleotide and thereby changing the sequence CGATACT to CGATATT (see Johnson (2009)). Among these sequences, it is important to appreciate that some regions are more prone to containing pathogenic mutations than others. For example, regions like exons that encode proteins generally have a lower frequency of variants than non-coding regions (Smits et al (2004)).

As we noted in Part 1, these genetic mutations are vast in number, with many thousands often observed for a single gene. While meticulously cataloged, variant databases typically do not contain comprehensive information on the observed rate of SNPs in the US population or demographic and geographic sub-groups. Therefore, establishing which SNPs — whether they appear within a sequence or as a heritable mutation explicitly detected and recorded — could contribute non-negligible risk and at what level is not currently feasible.

This unfortunately necessitates that the default approach be followed in lieu of a more sophisticated, targeted mitigation of quantifiably more risky sequences. Indeed, in making these judgments, a privacy expert would likely have to exercise some subjective, contextual reasoning for the sequence length threshold at which permissibility could reasonably be granted. Similarly, there is no established quantification metric for assessing the combined risk of multiple SNPs within a sequence. These examples should serve to highlight some of the real, practical challenges facing this area of risk assessment!

Conservative assessment with respect to reference data

It is worth pausing here to reflect on a subtlety of risk assessment peculiar to genetic health data. Perhaps contrary to one’s intuition, rare diseases should not intrinsically increase the risk of a dataset because there is a lack of reasonably available datasets on the breakdown of rare diseases across demographic and geographic categories that could be leveraged to relate such information to individuals.

However, it behooves us to not treat rare genetic variants according to this standard.

Due to the ever-increasing frequency of genetic sequencing and the dynamic nature of publicly available genetic data, the appropriately conservative stance should be to assume that an unknowable reservoir of genetic sequencing data is already available — or could soon become so — to an anticipated recipient of a health dataset.

Why oncology data is often low risk

We have painted a rather stark picture in our discussion of the challenges surrounding quantifying the risk of genetic mutations. We should also not forget that some sequences are themselves liable to uniquely identify individuals — the Y-chromosome short tandem repeats of the Erlich study springing out as an emphatic case.

However, it is important to shine a bright light on the fact that the predominant form of sequencing data concerns tumor suppressor genes or proto-oncogenes. Such research forms the spine of oncology success in the modern era.

A critically important feature of such sequences is that, despite being highly unique, they are somatic in nature (Greenman et al (2007)). This makes linking to reference genomic data highly unlikely as this is generally derived from healthy somatic cells, not those subject to genetic disease (Balouz et al (2019)). The great advantage of this is that tumor sequencing data — with all its clinical and therapeutic significance and, indeed, associated indirect genetic information like biomarker tests — would likely fall on the permissible side of the default risk assessment threshold. However, it is worth noting that further research to support this position and to identify potential edge cases to which it should not apply would be valuable.

Further research

The key to refining genetic disclosure risk as a sophisticated subset of health privacy science is to improve quantification metrics for the types of risk discussed above. This is no easy feat so let’s highlight a reasonable starting point: examining the distribution of genetic values throughout an individual dataset could be argued as a basis itself for assessing the likely group sizes of individuals carrying such information. This may indeed be applicable in specific situations where there is no contextual bias towards a particular rare class of mutations that all individuals in the dataset are being tested for to begin with. Whether it is successful would also depend on how well-defined the distribution itself is — in many practical cases, unique values predominate (though they may or may not be uniquely identifying) over repeated ones, which can render this method ineffectual.

We are, therefore, left with the following challenges:

Can reference data on the frequency of thousands of pathogenic genetic variants for a wide range of genes be systematically collated from available studies?
1. These studies are likely to be academic or clinical papers surveying a subset of variants for a particular gene.
2. It should be considered that in many cases there will be limited sample sizes and pronounced biases arising from the nature of the studies.
3. We should therefore pose an open question to the research community — what are the prospects, now or in the future, for surveying and collecting data on a large scale across many genes, variants and demographic/geographic groupings?

What is an appropriate statistical framework for leveraging the genetic variant frequency data that is available in order to estimate the group size of individuals in the US with a particular variant?
1. This would be the basis for quantifying the risk carried by variants within a dataset, and their resultant permissibility to exist within a de-identified dataset under HIPAA.
2. Consider variants for which no frequency reference data is available but which belong to a gene for which a frequency distribution can be partially constructed for other variants of the same gene. (These are likely to be rare variants and therefore of greatest pertinence when found in data). How should the risk carried by such variants be assessed — is it possible to make some semi-quantitative statement based on the partial distribution constructed for all variants of the gene?
3. Could such a framework be further refined to account for potential correlations between variant incidence and other indirect identifiers like age, gender and geographic location?

Can non-coding sequences be systematically characterized and assessed for risk based on length, inferred transformations and other relevant qualities?
1. In many practical scenarios, precise details of mutations may be less important than aggregate information like whether an inversion, transversion, deletion etc. has taken place. Such high-level information is likely to carry far less disclosure risk.
2. Such information may be present in real data through a range of direct and indirect formats which may make these interpretations non-trivial to perform.
3. Can mutations be automatically identified and grouped by transformation categories like ‘inversion’ or ‘transversion’ that could reduce risk while preserving utility?

Final thoughts

As data becomes more complex and niche in its applications, so too must privacy science engage increasingly rigorous statistical methods. In the case of genetic data, its richness, vast potential for utility and increasing abundance is reflected in the sophistication of the approach to risk that it demands.

In this article, we have conveyed the perspectives of experts at the forefront of that challenge. In particular, we have sought to depict the nuances of how genetic health data came to be treated under HIPAA and what that treatment should translate to in practical assessments. The resulting risk philosophy is well-founded, but invites a deeper foundation of quantifiability be laid.

In the future, the success of this endeavor will depend in large part on the availability of reference data describing the distribution of pathogenic genetic signatures across multiple population groups by which risk is traditionally measured. It is therefore paramount that, as the promise of clinical genetic research grows in the next decade of innovation, fully quantified disclosure risk standards are embraced with equal fervor.

Thank you to Jamie Blackport, Jonah Leshin, PhD, Rebecca Slisz, Kyle McLean, PhD, Ben Thackray, PhD, and Adja TourÃ©, PhD, for their feedback, and to James Gow, David Copeland, PhD, Elaine Mitchell, PhD, and Adja TourÃ©, PhD, for their foundational work upon which this article draws.

References

Ballouz, S., Dobin, A. & Gillis, J.A. Is it time to change the reference genome?. Genome Biol 20, 159 (2019). https://doi.org/10.1186/s13059-019-1774-4

Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, Davies H, Teague J, Butler A, Stevens C, Edkins S, O’Meara S, Vastrik I, Schmidt EE, Avis T, Barthorpe S, Bhamra G, Buck G, Choudhury B, Clements J, Cole J, Dicks E, Forbes S, Gray K, Halliday K, Harrison R, Hills K, Hinton J, Jenkinson A, Jones D, Menzies A, Mironenko T, Perry J, Raine K, Richardson D, Shepherd R, Small A, Tofts C, Varian J, Webb T, West S, Widaa S, Yates A, Cahill DP, Louis DN, Goldstraw P, Nicholson AG, Brasseur F, Looijenga L, Weber BL, Chiew YE, DeFazio A, Greaves MF, Green AR, Campbell P, Birney E, Easton DF, Chenevix-Trench G, Tan MH, Khoo SK, Teh BT, Yuen ST, Leung SY, Wooster R, Futreal PA, Stratton MR. Patterns of somatic mutation in human cancer genomes. Nature. 2007 Mar 8;446(7132):153-8. doi: 10.1038/nature05610. PMID: 17344846; PMCID: PMC2712719.

Gymrek M, McGuire AL, Golan D, Halperin E, Erlich Y. Identifying personal genomes by surname inference. Science. 2013 Jan 18;339(6117):321-4. doi: 10.1126/science.1229566. PMID: 23329047.

Johnson AD. Single-nucleotide polymorphism bioinformatics: a comprehensive review of resources. Circ Cardiovasc Genet. 2009 Oct;2(5):530-6. doi: 10.1161/CIRCGENETICS.109.872010. PMID: 20031630; PMCID: PMC2789466.

Smits BM, van Zutphen BF, Plasterk RH, Cuppen E. Genetic variation in coding regions between and within commonly used inbred rat strains. Genome Res. 2004 Jul;14(7):1285-90. doi: 10.1101/gr.2155004. PMID: 15231744; PMCID: PMC442143.