Piecing together a comprehensive and accurate view of the patient journey through health systems – for example, obtaining a complete prescription history or linking clinical trial data with future healthcare encounters – is critical to ultimately improving patient outcomes and patient safety. There is a variety of literature describing the logic and performance of various matching techniques (e.g. , , ).
While these articles typically speak to whether a matching algorithm is “privacy preserving,” less frequently do they put into context the sources of privacy risk stemming from various methodologies. Any given approach to patient matching carries its own elements of risk. Exploring these risk factors and how they can be mitigated is vital for implementing a process that is both privacy-first and fit-for-purpose.
Privacy risk is inherent to patient matching
Matching a patient’s electronic health records within or across datasets involves the addition of information to a patient’s record. The more information contained in a given record, the easier it is to determine the individual’s identity.
This is particularly true for demographic data as compared to, say, a blood pressure reading or lab test ID. For example, if we know a patient is female and lives in zip code 72842, then that patient must be one of approximately 500 such individuals; however, if we know in addition that the patient was born in 1940, that greatly reduces the number of possible patient identities (see Figure 1). This dynamic is inherent when merging datasets.
The risk of patient re-identification that is introduced by the sharing of healthcare datasets is governed by the Health Insurance Portability and Accountability Act (HIPAA). HIPAA requires that if a covered entity wishes to share patient information, it must either enter into a legal Business Associates Agreement (BAA) with the receiving party that ensures proper data use, or perform operations on the data to minimize the risk of re-identification.
Compliance with HIPAA is a baseline necessity, but should not be a substitute for a holistic approach to counteracting the risks at play in a particular situation. For example, how can we mitigate the consequences of a security breach at the site at which two datasets were brought together? If we were to use identified patient data to match patients, the impact of such a breach would be compounded. It is therefore critical to understand whether there is sufficient gain in utility to justify such practices.
Privacy must be balanced against utility
On one hand, the more information one has about the two records, the easier it is to determine whether the records represent the same or distinct individuals. At an extreme, having access to a full patient’s history in raw form and assigning a unique global identifier would be optimal from the standpoint of pure matching utility, though highly detrimental for privacy.
On the other hand, working with de-identified patient electronic medical records that convey less information and exist in locked down environments is preferable from a privacy standpoint, but has limited matching utility.
Fundamentally, patient matching comes down to a determination of similarity between pairs of records, measured discretely or on a continuum. There are a range of fields that a dataset may contain. Each field comes with its own bearing on both privacy and matching utility. These privacy and utility considerations do not lie along a linear continuum. We lay them out in Table 1 as descriptive pros and cons.
Table 1 includes references to Bloom filters and tokens. For an overview of these data types, see here.
In order to protect patient privacy one must thoughtfully select a matching methodology and its associated data requirements. Furthermore, the privacy risks outlined in Table 1 are not uniform across all use cases within a data type; rather, the risk conferred by the specific combination of variables within a given dataset should be analyzed by a privacy expert.
Matching best practices
Matching algorithms and data types go hand in hand. The matching logic one uses will require specific input data types, and, conversely, a given dataset is only suitable for certain algorithms. In light of the considerations in Table 1, we outline best practices for an overall matching framework.
- Consult a privacy expert
Matching methodology and matching data requirements should be reviewed by a privacy expert. It is important to understand and be comfortable with the risks associated with any approach. These sources of risks, which can come from the underlying data or the output of the matching process, are rarely obvious; even with individually de-identified datasets, new risk is introduced when the datasets are matched (as illustrated in the original example of Lab and Claims data in Figure 1).
Moreover, by means of an expert review, one may uncover new data fields that one may use for matching in order to boost matching performance without compromising privacy.
- Work with de-identified data
Matching should take place on de-identified data whenever possible. Even when not required by law, working with HIPAA de-identified data significantly reduces risk, and matching technology has evolved to the point that the gains in accuracy from matching on identified data are minimal  and typically not worth the privacy risk trade off.
- Design patient identifiers with care
A matching algorithm assigns a unique patient identifier to each patient. In order to protect patient privacy, these identifiers should be:
- Encrypted locally at any site that receives the records to mitigate the risk of a data breach at any one site.
- Not readily derivable from the underlying patient data in order to guard against a dictionary attack in which an attacker attempts to join new patient data with the existing patient identifier by generating the identifier from the raw patient data.
- Use context-flexible matching logic
Data fields across real world datasets will vary due to the decentralized nature of patient data collection processes across different sources and technologies. A wearable device may collect date of birth, email, and weight, whereas a patient registry may include name and address information.
Accordingly, a matching algorithm should accommodate a wide range of data fields. At the same time, however, a dataset that simultaneously contains multiple hashed representations (such as tokens or Bloom filters) that are collectively derived from every one of these fields would confer significant risk because of the patient-level specificity conveyed by a combination of so many data fields. Therefore, a matching algorithm on de-identified data should accept fields derived from a variety of PII fields even though we would only expect a subset of these fields to underlie any given dataset.
These principles have implications across a range of matching and matching-adjacent operations such as building in-house matching logic, evaluating third party match algorithms, obtaining the right datasets to match for a given use case, and mitigating privacy risk across all operations.
- Asher J, Resnick D, Brite J, Brackbill R, Cone J. An introduction to probabilistic record linkage with a focus on linkage processing for WTC registries. Int J Environ Res Public Health. 2020;17(18). Available from: http://dx.doi.org/10.3390/ijerph17186937.
- Bernstam EV, Applegate RJ, Yu A, Chaudhari D, Liu T, Coda A, et al. Real-world matching performance of deidentified record-linking tokens. Appl Clin Inform. 2022;13(4):865–73. Available from: http://dx.doi.org/10.1055/a-1910-4154
- Schnell R, Bachteler T, Reiher J. Privacy-preserving record linkage using Bloom filters. BMC Med Inform Decis Mak. 2009;9(1):41. Available from: http://dx.doi.org/10.1186/1472-6947-9-41
- Mirel LB. Privacy preserving techniques: Case studies from the data linkage program. Cdc.gov. 2021. Available from: https://www.cdc.gov/nchs/data/bsc/bscpres-Mirel-Data-Linkage-20210519-508.pdf
Editor’s note: This post has been updated on December 2022 for accuracy and comprehensiveness.