As part of the tragedy of COVID-19, tens of thousands of new cases are diagnosed in the US each day. In America, the data from these patients are captured across electronic medical records, medical claims, diagnostic tests, pharmacies, and mortality records.
Buried in this data are the answers to many of the questions vexing researchers: how the disease progresses, the impact of various co-morbidities, the safety and efficacy of various therapeutics, the genetic correlations of the disease, and the demographic disparities of the disease’s impact. Tens of thousands of researchers are hard at work trying to understand snapshots of the data to improve our understanding.
But the data is fraught with challenges: it is siloed across institutions, coded differently in different contexts, often incomplete, and not randomized. Data must be fit for purpose to answer the questions we face as a society. And patient privacy must be protected as researchers answer these questions.
This post is intended as a guide to the real-world data available within America’s health data infrastructure for the tens of thousands of researchers seeking to better understand COVID-19.
Introduction: Background on Real-World Data
While the COVID-19 pandemic has resulted in enormous suffering and cost, it also has been a catalyst for changes that healthcare industry veterans, innovators, and patients have spent decades advocating for, and which are now happening in a matter of months.
Perhaps the most significant of these is the use of so-called “real-world data” (or “RWD”) — that is, any health data that is collected in the ordinary course of care rather than in the context of a controlled clinical trial. Researchers can use RWD to better understand disease burden, the effectiveness of drug and non-drug treatments, and the impact of both government and social support programs on disease outcomes.
Used correctly, real-world data can answer questions like:
- Which drugs are being prescribed off-label to COVID-19 patients, and which ones correlate with reduced hospitalization, ventilator use, and/or mortality?
- How do other health conditions impact the likelihood of mortality from COVID-19?
- How does COVID-19 status impact the progression of other diseases?
- What pre-existing prescription drug usage is correlated with protective benefits from hospitalization, ventilator use, and mortality?
- How do socioeconomic factors (race, gender, urban vs. rural, income level, etc.) impact risk of hospitalization and mortality?
- Does the presence of a caregiver in the home make a person more or less likely to be tested, be hospitalized, and/or have a bad outcome?
- What subset of patients is most likely to end up in the ICU?
- Do veterans who received community care versus VA care differ in risk of mortality?
- Does living in public housing put patients at higher risk of being infected with COVID-19 and/or dying?
- What pre-existing prescription drug usage is correlated with protective benefits from getting COVID-19?
Real-world data has enormous advantages, including scale (each year, across the U.S., people have billions of interactions with the healthcare system), timeliness (much of this data can be made available within weeks or days), and cost (compared to large randomized controlled trials that may cost north of $100 million to run, RWD studies are inexpensive to conduct). Of course, such data also has limitations — it is fragmented across the healthcare system, can be messy and incomplete, and studies need to be thoughtfully designed to prevent unjustified inferences.
Real-world data is a vast and still mostly untapped resource in U.S. healthcare, but we will have to rapidly improve our ability to make effective use of it. That means understanding what combinations of real-world data are “fit for purpose” to answer different research questions, and what data can be used to validate findings through parallel studies.
And importantly: it is critical that information collected as real-world data is collected ethically, that data is adequately de-identified, and that patient privacy is protected throughout the process.
We’re providing here a guide to (1) the value and limitations of major real-world data types and (2) the necessity of linking multiple RWD data sets at the patient level to create true “fit-for-purpose” data sets, filling gaps and creating more valuable data sets for a variety of pressing research questions.
Part 1: Major Real-World Data Types
As we’ve written about in the past, data in the US is distributed across a complex ecosystem of thousands of institutions, captured in this diagram:
Below is an overview of the major types of data that can be used for COVID-19 related studies, and the strengths and weaknesses of each data type for specific analyses.
Mortality data. despite being one of the most important endpoints in healthcare analysis, is surprisingly hard to come by in traditional healthcare data sets. Unless a patient dies during care (in a hospital or long-term care facility), the event is not recorded in EHRs or other standard data flows. Mortality data may include the date of death, cause of death, and other demographic and geographic information on the deceased.
Fit-For-Purpose Assessment: Given its absence from many traditional health data sets, mortality data must be compiled from a variety of sources, including government databases (death certificates are available from states, and both CDC and the Social Security Administration maintain death indices), obituary data, and/or life insurance data.
- CDC’s National Death Index (NDI) is the premier data set in both coverage of deaths in the United States, and in depth due to its inclusion of the cause of death. Cause of death is important in precisely measuring mortality related to a specific condition — without it, mortality due to accidents and other causes could be attributed to use of a therapeutic or medical procedure, or to disease progression. Unfortunately, the NDI has only been available to academic researchers to date.
- Social Security Administration’s Limited Access Death Master File (LADMF) is built from deaths reported by states for the purpose of making accurate social security payments, but many states have opted out of such reporting over the last decade. While the LADMF records mortality events going back more than fifty years, its coverage in recent years has fallen to just 15–20% of all U.S. deaths. Further, access to the weekly-updated LADMF data set is limited to select approved use cases.
- Obituary data is publicly available through website and newspaper postings, and can be obtained by scraping internet sites or licensing it directly from obituary-posting service providers. Aggregated obituary data offer coverage of >70% of mortality events in a timely fashion, but do not contain a cause of death and likely under-represent the poor.
- Life insurance providers record mortality events when a claim is filed, and often a cause of death is recorded in their data. However, the life insurance industry is highly fragmented, making it difficult to assemble a comprehensive set of data. This source of mortality highly under-represents the poor and the young.
Electronic health record (EHR) data are collected in the ordinary course of hospital and ambulatory care visits. The information is entered directly by the physician or nurse, and that information is then supplemented with lab, imaging, radiology, and genetic testing results as it is sent to the treating physician for record keeping.
Fit-For-Purpose Assessment: De-identified EHR (also called electronic medical record, or EMR) data is well purposed to understand the entire patient experience at a single provider, and covers symptoms, diagnostic testing, and medical treatments and procedures. EHR data is best used to understand how a physician arrived at a diagnosis and treatment decision, as the patient record should contain the information that was available to the physician in making their assessment.
However, numerous vendors provide EHR software, and therefore, each provider facility may use a different system. This fragmentation means that as a patient is referred from a PCP to a specialist; or from an outpatient setting to in-patient or long-term care, that patient’s data will often be recorded in different systems. Therefore, EHR data are insufficient to understand the patient journey over a long period of time, whether assessing the patient’s path across providers to arrive at a final diagnosis, or trying to understand the long-term outcomes of a treatment decision.
Additionally, while EHR data will commonly contain drug prescription information, it is important to remember that this data only reflects what the physician has prescribed, and not what the patient has actually filled at a pharmacy. For that information you need to use pharmacy claims data (discussed below).
To best understand which de-identified EHR sources are fit for your analytics, it is important to know the different types of EHR systems used by providers:
- Solo and small physician practices: To claim the incentive payments offered under the Affordable Care Act to install EHR systems, many small practices chose free EHR software from vendors like Practice Fusion. These data can often be accessed directly from these vendors, with good standardization across the data set since little to no software customization is provided.
- Medium-to-large ambulatory physician practices: Larger practices are able to afford more advanced EHR software, but often opt to not customize the installation, leading to strong standardization across providers as well. These data can likewise be accessed directly from an EHR vendor (e.g., Allscripts), but may also be accessible through provider quality benchmarking and cost-optimization services (e.g., HealthJump or Health Catalyst) who process the EHR data of these providers to measure operational efficiency.
- In-patient practices and hospitals: Academic medical centers and large health systems serve a diverse and complicated set of patients, and therefore install highly customized EHR systems. While there are some suppliers (e.g., Epic) who specialize in these installations, these vendors often do not have rights to the data. Therefore, each provider must make their data available to researchers. Some networks like PCORnet have been created to make data accessible across this sector for approved research efforts.
- Long term care: Skilled nursing facilities and other long-term acute care facilities care for an especially vulnerable patient population, and are served by specialized EHR vendors like PointClickCare.
- Government care: Active duty military personnel and veterans may receive care at VA clinics or in the community setting. The VA record set is therefore an important data set for studying this population, as well as for studying diseases that are prevalent among veterans like PTSD and traumatic brain injury.
- Specialists: Some specialties have specific data recording requirements, and as such, there are EHR systems that have been customized to serve these sectors. These EHR systems frequently capture data necessary for understanding diseases tied to those specialities, for example, tumor measurements for oncology patients. For example, FlatIron provides a system specifically designed for oncology practices, Nextech is optimized for ophthalmology, and TherapyNotes is focused on mental health practices. It is also important to know that some of these systems can be expensive, and therefore may not be used in solo and small specialty practices.
Medical claims data are created every time a patient receives a medical service that is billed to their insurer. An insurance claim is sent from the provider to the payer in a standardized format (the 837 EDI format) and includes information about the visit to be paid for (patient demographics, diagnosis and procedure codes, dates of service, etc.) and information required for payment, such as the doctor, treating facility, and patient’s insurance provider.
Fit-For-Purpose Assessment: Medical claims are captured at a number of points, either by software used at the provider to generate the claim, at a claims clearinghouse that routes those claims to the proper payer, or at the payer themselves. Because of their standard format and use across all specialties and provider types, claims are an excellent way to get large sample sizes for disease epidemiology and comorbidities, understanding the provider landscape, and following a patient’s journey across the health landscape. A medical claim can also contain drug treatment information for drugs that are administered by the doctor (e.g., infusions of drugs like Remicaid), but will not contain retail pharmacy prescriptions.
However, the medical claim format does not include many important data fields necessary to understand why a diagnosis or treatment decision was made because it does not contain any diagnostic testing information, physician notes, or other details. For that, de-identified medical claims need to be linked with de-identified EHR data. Additionally, a medical claim is only filed for a reimbursed service, and therefore this data type is not useful for analyzing any service for which the patient pays out-of-pocket (e.g., over-the-counter drugs).
There are multiple sources of medical claims data, and each has specific strengths and weaknesses:
- Revenue cycle management (RCM): RCM software is used at provider facilities to help practices submit claims to insurers, bill patients, and track payments from both. There are numerous vendors offering RCM solutions, including companies such as Ability Networks, Waystar, and OfficeAlly. Claims gathered from these sources are timely, but geographic coverage can vary depending on the practices using the service. For example, Office Ally has national coverage, but is especially strong on the West Coast. Firms providing operations analytics to providers may also have data use rights to the medical claims sent to them as part of these analyses, and can be an access point as well.
- Medical claims clearinghouses: Once a claim is generated, it is routed by special software to the correct payer for payment. This routing service is called a claims clearinghouse or a “switch” (note that some RCM services are integrated with claims clearinghouse services). There are only a few major claims clearinghouses in the United States, meaning that accessing data from any one of them gives a researcher data about an enormous portion of the population. Data is typically available in just 1–2 days post submission, making it one of the most timely data sources for measuring rapidly developing events like the spread of the COVID-19 pandemic.
- Private payers: Payers are the final destination for a medical claim, and a great source for seeing all of the medical activity for a patient regardless of the switch the claim was processed through. For a longitudinal analysis of a patient across their entire journey, payer-based claims are often the optimal source to use. However, because Americans change insurance every 2–3 years on average, the longitudinal analysis that can be done on a single payer data set is often limited). Unfortunately, payers can be slow to process their claims, and some claims are still filed outside of the digitized process described above. These issues mean that payer-based claims can have a lag time of up to 3 months and are sub-optimal for measuring rapidly developing public health situations. Private payer data can be difficult to access as most payers do not make it commercially-available, but there are several large data sets available through Optum (based on United Healthcare’s subscriber base) and IBM (through their MarketScan data, gathered from risk-bearing employers). When evaluating payer data sets it is also important to note that some payers receive claims directly from providers (not through a clearinghouse), the most important of which are the “Blues” (Blue Shield and Blue Cross plans).
- Federal government payers: More than a third of Americans receive their insurance coverage through the government (Medicare for the elderly and disabled, Medicaid for low income households, and TriCare for the military). Some Medicare data is available through the Medicare Advantage programs administered by private insurers (see above). Medicare data can be accessed through CMS’s qualified entity (QE) program with some important restrictions.
- State government payers: Medicaid services are often administered by private insurers in each state, and therefore these data may sometimes be found in claims accessed from private payers. There are also some RCM vendors with a large installation base in Medicaid providers, such as Ability Networks. There are also special programs like 340B that provide payments for disadvantaged patients. 340B data gives insight into the diagnosis and treatment of underserved patients typically receiving care at free health clinics. This data is particularly useful for understanding access to care for disadvantaged populations.
Pharmacy claims data are generated by pharmacies in order to be paid for the cost of a prescribed drug or medical supply that is dispensed to a patient (the patient pays the copay, and the remainder is billed to the payer).
Fit-For-Purpose Assessment: A pharmacy claim will include dosage information, drug strength, fill dates, financial information, and de-identified patient and prescriber codes and is the best source for understanding prescription patterns by providers. For this reason, this type of data is the backbone of incentive compensation calculations for pharmaceutical sales reps, and is closely tracked by brand teams (and financial services) to understand market share, competitive dynamics, and changes in prescribing behavior due to new drug launches, generic drug entry, and marketing campaigns. Pharmacy claims are also generated for mail order prescriptions, which are particularly important when studying chronic conditions.
Pharmacy claims by their nature do not capture transactions that are not billed to insurers, but are instead paid for out-of-pocket by the patient (e.g., over-the-counter medications). Likewise, pharmacy claims may miss capturing specialty drug treatments where the drug is injected or administered in a medical setting. Such drugs are covered as medical benefits and billed as medical claims.
Like a medical claim, a pharmacy claim can be captured at the point of generation (the pharmacy), at a claims clearinghouse, or at a payer. Each source has different advantages and disadvantages:
- Retail pharmacy chains: Retail pharmacies like CVS and Walgreens generate a multitude of pharmacy claims for the drugs they dispense (liquids, pills, etc.), but most have entered into contracts with pharmacy claims aggregators and this data is hard to come by elsewhere. That being said, some of the same information may be gathered by operations software installed at the pharmacy, such as McKesson PTS’s pharmacy services.
- Specialty pharmacies: Many of the new drugs launched in the United States are specialty products, which are injected or infused. These products require additional patient support, and are therefore dispensed by specialty pharmacies who can provide those services. However, because pharmaceutical companies contract with a subset of specialty pharmacies to dispense each brand, those companies often own the data rights to this dispensing information. To access it often requires having the pharmaceutical company contract with the specialty pharmacy to deliver the data.
- Hospital pharmacies: Hospital pharmacies dispense drugs for inpatient treatment, but these data are hard to access as hospitals do not tend to grant data use rights to their vendors (similar to their stance on their EHR data).
- Pharmacy claims clearinghouses: Like medical claims, pharmacy claims are electronically routed to the proper payer for payment. Much of this data is also exclusively licensed by pharmacy aggregators and not available except through those sources.
- Payers: Similar to medical claims (discussed above), payers are great sources for seeing all of the prescriptions a patient has filled, but there is a substantial time lag and patients are again lost once they change insurers (every 2–3 years on average).
- Pharmacy benefit managers (PBM): Pharmacy benefit managers that act on behalf of payers to determine which drugs are covered, and to administer their pharmacy coverage, may have access to pharmacy claims to measure the impact of their programs. However, it is rare that PBMs make this data available without the payer’s permission.
Claims remittance data are generated when a payer reviews a medical claim and determines how much they will pay the provider (a process called “adjudication”). Claims remittances are sent from the payer back to the provider using a common format (the 835 EDI).
Fit-For-Purpose Assessment: A claim remittance has important cost information for the services provided to a patient, and is a key data source for cost effectiveness and pharmacoeconomics studies. However, the claim remittance does not include the clinical information, and therefore must be paired with a medical claim (using a unique claim ID) to understand what the payment is for.
Claims remittance data may be captured through the same set of participants discussed above for medical claims, with generally the same advantages and disadvantages except that remittance data can have a long time lag given the long adjudication process at some payers.
Chargemaster and group purchasing data is billing data available from the financial systems of providers and group purchasing organizations (GPOs), the entities that help groups of providers realize savings by aggregating purchase volume and negotiating discounts with vendors. Chargemaster data is at the patient level, and includes information on the charges they incur during inpatient care. GPO purchasing data is often at the facility level, and includes the bulk supplies they have ordered.
Fit-For-Purpose Assessment: Chargemaster and GPO data are good sources of information for the assessment of the overall cost of goods and supplies for cost effectiveness studies. However, critical details such as individual drugs are often not line-itemed in the chargemaster data and so detailed analytics such as pharmacoeconomics are not possible using this data.
Diagnostic lab testing data are collected at the testing facility. Most diagnostic lab tests are sent out to third-party lab testing services (such as Quest or LabCorp), though for inpatient care the hospital may perform the lab tests themselves on premise.
Fit-For-Purpose Assessment: While the ordering of a lab test is recorded in EHR and claims data sets, the results often are not included in EHRs and not included in claims at all. Lab testing data is vital for understanding the severity of a disease (e.g., level of cholesterol). Such data is also critical to distinguish between suspected disease and actual disease. Just because a physician diagnoses someone with a condition (and uses the corresponding ICD code) does not mean that the physician’s diagnosis will be confirmed by the subsequent lab test.
While enormously valuable, lab data can be challenging to work with because test codes are not standardized across different lab data sources, and lab data is fragmented across a large number of small labs. Further, nearly half of lab testing is run at hospital-affiliated labs and individual physician’s offices, leading to major data gaps.
Genetic testing data are becoming more important as medicine becomes more personalized. Genetic testing labs specialize in identifying specific genetic variants that are known to be associated with diseases of interest.
Fit-For-Purpose Assessment: Genetic testing data is a vital element in most oncology analytics, as the results offer vital information about the cancer type and the treatment that is most likely to be effective. Any study that is looking at patient segmentation, treatment choice, and outcomes needs to include the genetic biomarkers of the cancer in their analysis. As more drugs come to market for niche populations, especially in rare and orphan indications, genetic test results will grow in importance.
Genetic testing data is spread across a large number of genetic testing companies, including companies like Foundation Medicine, Invitae, Myriad Genetics, and Qiagen. Each company in this space has developed its own genetic testing panel composed of a different mix of genetic variants that are assessed. This lack of standardization makes it difficult to aggregate results to inform analyses of the overall patient population, and the ideal fit-for-purpose testing source is likely to differ depending on the disease of interest. Further, the sensitive nature of the data forces many of these organizations to be extremely careful about when and to whom they make the data available and requires researchers to take additional measures to protect patient privacy.
Genomic (DNA) sequencing data are collected by specialized labs (e.g., Helix), as well as by consumer vendors (e.g. 23&Me and Ancestry.com) to educate interested customers on their ancestry and predisposition to various conditions.
Fit-For-Purpose Assessment: Genomic sequences are very important for uncovering potential genetic causes for undiagnosed diseases, and for identifying broader and/or novel biomarker associations with diseases of interest. For these purposes, these data are only effective when paired with a comprehensive medical history for the individual and their family members that offers the corresponding phenotypic data. Further, working with genomic sequencing data requires a high degree of specialization (more so than other real world data types), and poses privacy challenges given the highly detailed information available in such data.
Patient registries are disease-specific collections of data of exceptional depth. Most registries are supported by medical or academic societies to support in-depth research of their patients, and are built from painstaking chart reviews, patient surveys, and collation of other sources.
Fit-For-Purpose Assessment: Patient registries are often the deepest single data set available to assess a disease state, with a strong collection of clinical data and medical history, often supplemented with patient-reported behaviors and outcomes. However, due to the time and expense of collecting this data, registries often resemble clinical trials with smaller sample sizes, and substantial time lags due to intermittent collection periods. And, like clinical trial data, they are tightly controlled by their owners.
Social determinants of health (SDOH) data may be captured to a limited degree in traditional health data sets, but more detailed information around race, behavior, and socioeconomic status is often more systematically collected in non-traditional datasets.
Fit-For-Purpose Assessment: Today, we are seeing wide variances in health outcomes by sex, race, and along community lines that can only be partially understood by analyzing traditional health data — demographic data is key to understanding these disparities. Renewed focus on SDOH has given rise to new health data companies like Socially Determined that are focused on providing useful SDOH data and analytics for healthcare companies. Much of this data is collected from non-traditional sources that have more often been used in consumer and marketing settings than in health settings.
Demographic data only becomes valuable when linked with traditional health data, and such links have to be done carefully and thoughtfully to ensure that patient privacy is protected. Given the amount of non-healthcare data available and the inability to include it all from a privacy perspective, it is important to be selective about which data type is best suited for each use case:
- Government (local, state, and federal) agencies provide critical support programs for vulnerable populations, and these data can be used to understand critical social determinants of health such as food and housing security. Data from agencies like Housing and Urban Development, from programs like SNAP (food stamps), and similar data can be hard to access outside of research settings.
- Loyalty card programs from grocery stores and other retailers collect purchasing behavior from large groups of Americans, and can give insight into the amount of alcohol or tobacco use, diet, and use of over-the-counter medications.
- Financial demographics data is collected from financial institutions and credit ratings bureaus, and can be used to segment patients by their ability to pay and to understand the correlation of income to healthcare access and outcomes.
- Household demographics data is collected from a variety of sources, and includes important information about the living environment of a patient such as whether they live alone or with a caregiver, in a high crime area, have high exposure to allergens or toxins, etc.
- Patient surveys are regularly used to collect information about their attitudes, behaviors, and needs. These include custom designed data collection instruments, as well as standardized questionnaires like Kantar’s National Health and Wellness Survey.
- (And many more…)
Devices, sensors, and wearables data is becoming more available every year through connected medical and consumer devices. These devices capture real-time data about vital signs and other important markers of health status.
Fit-For-Purpose Assessment: Connected consumer devices are becoming increasingly prevalent, and can provide rapid updates of health status. For example, Kinsa’s smart thermometer data could be used to track the outbreak of an infectious disease (as they were during the early days of the pandemic), and FitBit’s data can be used to correlate activity with health outcomes. However, these devices can have irregular usage or usage correlated with health outcomes (for example, use of a smart thermometer only when symptomatic). These devices may also capture data for multiple individuals in the same household (e.g., smart scales), making it difficult to correctly connect to other data sets.
Connected medical devices like glucose monitors, ECG monitors, pacemakers, and the like can stream continuous data to the manufacturers, patients, and physicians. This data can be used to predict adverse events and outcomes, but is not often shared beyond the manufacturer.
Clinical trial data are collected by academic investigators and by sponsors in very controlled settings.
Fit-For-Purpose Assessment: Clinical trial data sets provide deep insight into specific drug and device efficacy and safety, and are the only way for confirmatory studies to take place — thanks to the benefits of randomization and controlled trialed environments. However, these data sets have no visibility outside of the trial setting, and are therefore poor for correlating the patient experience during the trial to the rest of their patient journey; additionally, they tend to be slow and expensive. While nothing matches the ability of clinical trials to prove an outcome, other data sets can augment them. Additionally, clinical trial data can be linked to real-world data to receive the benefits of both randomization and breadth of data.
Part 2: Linking Real-World Data Sets to Answer Key Questions
While each real world data type is valuable on a stand-alone basis, each data type also has disadvantages and limitations. Additionally, within each data type, there are often several different potential sources of data, each of which has additional gaps in coverage and may under-represent specific geographic or demographic groups of patients. To create a data set that is even more fit-for-purpose, therefore, multiple data sets can be combined at the patient level to fill critical gaps in coverage and to add missing variables.
Below are some examples of how different combinations of real-world data can be used to respond to some of the most pressing questions that people are trying to answer in relation to COVID-19.
Real-world data is a tremendously valuable resource for everyone who wants to better understand and improve health data, but the effective use of real-world data is not without its challenges. As the country moves to reopen while infections continue to spread, now is the time to navigate those challenges to unlock the full value of real-world data for healthcare.
Travis May is the Chief Executive Officer of Datavant, a company whose mission is to connect the world’s health data to improve patient outcomes.
Jason LaBonte, PhD is the Chief Strategy Officer of Datavant.
Thank you to Vera Mucaj, PhD, Su Huang, and Sarah Rowe for their support in drafting this article.
Together with Health Care Cost Institute, Change Healthcare, Veradigm, Symphony Health, Healthjump, Office Ally, Medidata, and Snowflake, Datavant is a founding member of the COVID-19 Research Database, whose goal is to make fit-for-purpose real world data available pro bono for noncommercial COVID-related research.
How America’s Health Data Infrastructure is Being Used to Fight COVID-19 was originally published on Medium, where people are continuing the conversation by highlighting and responding to this story.
Editor’s note: This post has been updated on October 19, 2022 for accuracy and comprehensiveness.