Blog /

Major Breakthroughs in Biomedical Research Made Possible by Real-World Data

Publish Date
Read Time
Karin Eisinger
Kathleen Gavin
Vera Mucaj
May 13, 2024

Translational research bridges the gap between basic scientific discoveries and clinical applications, playing a crucial role in understanding and enhancing patient health outcomes. In recent years, the integration of Real-World Data (RWD) - such as electronic health records (EHRs), medical claims, and other patient information not related to clinical trials - has become instrumental in explaining scientific findings and driving clinical advancements. Advances in data linkage technology now allow for the seamless integration of diverse RWD sources, providing comprehensive insights into diagnoses, healthcare usage, comorbidities, prescriptions, family medical history, and past health issues.1. These data offer a tremendous opportunity to establish unique fingerprints for individual patients and identify commonalities among patients underlying causal relationships to disease incidence or patient outcome. Below we discuss five exciting developments in RWD-enabled discovery.

A connection between Epstein Barr Virus infection and Multiple Sclerosis

A compelling example comes from Bjornevik et al2. The study evaluated medical records from 10 million military service personnel using samples from the Department of Defense Serum Repository (DoDSR), which houses longitudinal patient records from military service personnel. After reviewing 20 years of data collected at sites all over the country, the researchers reported that infection with Epstein Barr Virus (EBV) increased a patient’s risk of developing Multiple Sclerosis (MS) by ~32 fold, approximately the same level of risk associating smoking with lung cancer. Virus-mediated MS development has long been a topic of speculation and investigation3 along with other theories on the role of Vitamin D deficiency, obesity, and smoking. However, laboratory work over several decades, and from multiple groups, resulted in the idea that EBV-infected B cells could cause an autoimmune reaction that damaged the central nervous system (CNS). Following the publication of DoDSR-supported findings, Lanz et al3. published a possible underlying molecular process by which EBV-infected B cells can promote MS development. These findings can now be used to develop novel therapeutic interventions for MS, including EBV vaccines, B cell depletion, and others. The incredible power of the combined molecular and RWD insights has profoundly shifted our understanding of CNS disorders and highlights the need for investigation of the long-term effects of viral infections including common viruses and new ones including SARS-CoV-2, which causes COVID19.

This research was uniquely positioned to facilitate the EBV-MS discovery because it used data generated from both patient blood samples and their longitudinal medical history. Buy-in from academic institutions and federal support for similar datasets and linkage of disparate siloed RWD with patient information could be applied outside this framework to the general public, allowing the same progress in multiple disease contexts.

A new understanding of Alzheimer’s disease

The root causes of Alzheimer’s disease (AD) are still unclear. There are some genetic mutations linked to AD, but the lack of consistency means we still don’t understand the full picture of AD and its origins. We do know that AD is associated with aggregation of the amyloid-beta (Aβ) proteins in the brain and blood vessels as well as by tangles of hyperphosphorylated tau protein. A collection of recent studies has shed new light on one potential origin of these protein aggregations, and suggest that in some rare cases AD is actually transmissible through small infectious proteins called prions, similar to those that cause the transmissible variant of Creutzfeldt–Jakob disease (CJD) associated with bovine spongiform encephalitis or “mad cow” disease4. This finding adds a novel etiology hypothesis to AD development that was not considered before.

The source of the amyloid proteins in these cases of AD is iatrogenic, or environmentally sourced. Each patient in these analyses received cadaver-purified human growth hormone, which was prescribed often to children of short stature to promote growth5,6.  The hormone was isolated from cadaver pituitary glands using a specific method that allowed prions to be passed from cadaver to patient and many years later resulted in AD spectrum disorders. Historical information about each patient’s growth hormone injections was contained in their medical records. In the UK where these studies were performed, patient records are maintained in a national database linked to a National Prion Clinic (NPC), which has been following individuals treated with human growth hormone. Connecting patient medical records with diagnostic centers such as NPC has led to tremendous advancement in our understanding of transmissible neurologic disorders including variant CJD and now AD.

There are two critical questions emerging from this RWD. 1) How will we track prion related neurological diseases, particularly in the US, which lacks well established national healthcare registries? 2) Now that both MS and AD have been linked effectively to EBV and prions respectively, which other neurological or general disorders are caused by environment exposures or transmissions?  [A silver lining in this finding: the chance of anyone getting AD via this type of transmission today is incredibly low, since the procedure referenced in this study has long been discontinued.

The long-term effects of COVID19 on cardiovascular and health

Recently, researchers uncovered another major RWD connection using the VA patient index, when studying COVID19 infections. In those infected with SARS-CoV-2, the virus that causes COVID19,the incidence of cardiac disorders like stroke and heart attack were significantly higher in people who had recovered from the virus compared to similar but uninfected individuals. Risk of heart attack was increased by 72%and the risk of stroke was increased by 52% in infected people for up to a year post infection7. The most substantial increased risk was associated with severe infection, suggesting that vaccination to blunt the disease is a critical strategy. Interestingly, no other COVID19 risk factors (i.e. obesity, diabetes etc.) impacted cardiovascular risk. These dramatic findings led other groups to study the molecular basis of the results and found that the virus infects coronary blood vessels and associated immune cells, causing inflammation and cardiovascular failure8.

One of the major limitations of VA databases is that they reflect only a subset.

Americans, military personnel, who are most often male and more than 50% identify as white. Without representative national sampling, we may have missed critical insights about the effect of COVID19 on heart health in women and minorities. Therefore, another research group studied data from the US collaborative Network in TriNetX, which included 1.2 million appropriate patients, and successfully verified the increase in cardiovascular illness up to a year after COVID19 infection9. In a separate study using the VA database, the same group reported increased neurological disorders(i.e. memory, migraine, and mental health disorders) following COVID1910.This study accounted for sex, race, age, and other variables. Interestingly, though TriNetX is a global consortium, 80% of patients were American making the data very informative for US based COVID19 strategies, but perhaps less so for European or Asian countries.

The conclusive long-term effects of COVID 19 are now emerging from healthcare data world-wide. It’s important to consider that demographic-specific responses to the pandemic varied significantly as do risk factors. Developing highly accurate, detailed, and diverse datasets is acritical step toward managing the national response to the virus in terms of vaccination strategies and overall pandemic protection.

Ultra-processed foods and early onset colorectal cancer: a new regulatory paradigm?

A large and growing source of healthcare spending is chronic illness, including cancer. There are more and more new therapies available and a cancer diagnosis is not necessarily a death sentence anymore- but wouldn't it be less costly and painful to prevent cancer at the outset? Cancer prevention is a critical area of study and recent innovation has shown that RWD can help identify the causes of certain cancers and inspire preventative measures. The long term financial, psychological, cardiovascular, reproductive and other consequences of cancer survivorship are a massive patient burden such that this area is now a unique and expanding independent field of study in its own right. The increase in cancer diagnoses, particularly of colorectal cancer (CRC), in younger Americans (<50 years old)11,means these burdens are being taken on by working families, which impacts economic output, home ownership, and all aspects of family dynamics.

RWD has been the source of nutrition and health hypotheses for decades. In the 1960’s, geographic analyses demonstrated different CRC rates are based on geography and migration. Studies showed that CRC rates were related to long-term geolocation, not country of origin, suggesting the role of diet in tumor formation12. This data led to some of the first studies into the role of nutrition in CRC incidence. RWD continues to be used to understand how our diets can modify our risk of developing CRC. For example, a 2022 meta-analysis of eight large studies published between 2013 and 2022 concluded that CRC incidence was associated with the availability of specific foods such as red meat, meat, animal fats, energy from animal sources, and an unhealthy foodenvironment13.

Ultimately, the source of CRC in younger patients is complex and not entirely understood, but recent insights from laboratory and clinical data suggest that the presence of nitrite preservatives in ultra-processed foods (UPF), particularly meat products, has devastating effects on the colon14-16. UPFs are the result of chemical processes and are generated in the commercial food industry (e.g. hydrogenated oils, modified starches, flavor additives, food coloring, and artificial sweeteners)17.These additives are inexpensive, appealing to taste, and are present in cheap and mass-produced foods, commonly consumed as part of the western diet. Nitrites in particular are present in many UPFs like hotdogs, deli meats, and prepackaged foods.18

RWD sources (e.g. US Department of Housing and Urban Development zip code, US FEI (food environment index), restaurant environment index (REI), retail food environment index (REFI), USDA food desert data sets, and food availability data from the Food and Agriculture Organization of the United Nations (FAO)) provide critical information about food and diet related factors in specific geographic locations. Going forward these data could be linked via modern computational tools to cancer statistics obtained from local registries, the NCI and electronic health records. Information on screening patterns from newer fecal and stool DNA tests like Cologuard, will be present in medical and insurance records and could also be linked. Together, these data would paint an important picture of the American diet as it pertains to increased risk of CRC in younger people. Moreover, this information should inform policy making. Many food additives and products popular in the US are already illegal in other countries, due to their negative effect on human health. How should federal regulators and lawmakers approach these issues? Armed with high quality RWD and laboratory data they can make the case for restricting damaging chemicals and food products from reaching the public, and reduce the incidence of CRC in younger people.

How do we expand the utility and benefit of RWD to increase discovery and breakthroughs?

Advances in technology now permit linkage of multiple types of data, using privacy protecting technology. Based on the above list or recent discoveries, the potential benefit of incorporating RWD into biomedical hypothesis generation and validation is almost limitless. There are of course obstacles to overcome in the context of RWD. Different data type shave varying quality, availability, and digital presence. Currently, medical/prescription claims data are the most mature, aggregated and linkable types of RWD. However, laboratory data, EHR data, mortality data, and other data sources are also nearing a critical level of “data completeness” and accessibility.

Patient privacy and the risk of potential re-identification of individual patients by “bad actors” is an important concern when using RWD. Strict technical and governance rules for protecting patient records and privacy are essential. In the US, data can be connected and aggregated in a de-identified (privacy-preserving) way. Using Privacy Preserving Record Linkage (PPRL) (also known as “tokenization”) each patient’s relevant RWD can be linked to without exposing a patient’s personally identifiable information (PII). This technology has improved significantly over the last decade, but perhaps more importantly the data has reached the necessary scale to offer substantial benefit.

New developments in RWD and PPRL technology have the potential to unlock new discoveries that dramatically impact patient care. As a result, collaborations across various institutions have led to the growth of a number of impactful clinical research cooperatives. Of note: PCORnet (Patient Centered Outcomes Research network), andN3C (the National Covid Cohort Collaborative) are two examples of groups using privacy protected data and robust governance models to exchange and connect healthcare data for research purposes. Ideally, federal and institutional support should focus on creating more of these research collaboratives to expand the impact of RWD on discovery and patient care.

Dr. Karin Eisinger is an Assistant Professor in the Sarcoma Program at the University of Pennsylvania School of Medicine. Dr. Kathleen Gavin is the Clinical Trials Scientific Lead at Datavant. Dr. Vera Mucaj is Chief Scientific Officer at Datavant.

Contact us to learn about what you can discover with RWD.


  • 1 Concato,J. & Corrigan-Curay, J. Real-World Evidence - Where Are We Now? N Engl J Med 386, 1680-1682, doi:10.1056/NEJMp2200089 (2022).
  • 2 Bjornevik,K. et al. Longitudinal analysisreveals high prevalence of Epstein-Barr virus associated with multiplesclerosis. Science 375, 296-301,doi:10.1126/science.abj8222 (2022).
  • 3 Lanz,T. V. et al. Clonally expanded Bcells in multiple sclerosis bind EBV EBNA1 and GlialCAM. Nature 603, 321-327,doi:10.1038/s41586-022-04432-7 (2022).
  • 4 Banerjee,G. et al. Iatrogenic Alzheimer'sdisease in recipients of cadaveric pituitary-derived growth hormone. Nat Med 30, 394-402, doi:10.1038/s41591-023-02729-2 (2024).
  • 5 Purro,S. A. et al. Transmission ofamyloid-beta protein pathology from cadaveric pituitary growth hormone. Nature 564, 415-419, doi:10.1038/s41586-018-0790-y (2018).
  • 6  Jaunmuktane,Z. et al. Evidence for humantransmission of amyloid-beta pathology and cerebral amyloid angiopathy. Nature 525, 247-250, doi:10.1038/nature15369 (2015).
  • 7  Xie,Y., Xu, E., Bowe, B. & Al-Aly, Z. Long-term cardiovascular outcomes ofCOVID-19. Nat Med 28, 583-590,doi:10.1038/s41591-022-01689-3 (2022).
  • 8 Eberhardt,N. et al. SARS-CoV-2 infectiontriggers pro-atherogenic inflammatory responses in human coronary vessels. Nat Cardiovasc Res 2, 899-916, doi:10.1038/s44161-023-00336-5 (2023).
  • 9 Wang,W., Wang, C. Y., Wang, S. I. & Wei, J. C. Long-term cardiovascular outcomesin COVID-19 survivors among non-vaccinated population: A retrospective cohortstudy from the TriNetX US collaborative networks. EClinicalMedicine 53,101619, doi:10.1016/j.eclinm.2022.101619 (2022).
  • 10 Xu,E., Xie, Y. & Al-Aly, Z. Long-term neurologic outcomes of COVID-19. Nat Med 28, 2406-2415, doi:10.1038/s41591-022-02001-z (2022).
  • 11 Akimoto,N. et al. Rising incidence ofearly-onset colorectal cancer - a call to action. Nat Rev Clin Oncol 18,230-243, doi:10.1038/s41571-020-00445-1 (2021).
  • 12 Wynder,E. L. & Shigematsu, T. Environmental factors of cancer of the colon andrectum. Cancer 20, 1520-1561,doi:10.1002/1097-0142(196709)20:9<1520::aid-cncr2820200920>;2-3(1967).
  • 13 Masdor,N. A. et al. The Link between FoodEnvironment and Colorectal Cancer: A Systematic Review. Nutrients 14, 3954(2022).
  • 14 Cordova,R. et al. Consumption ofultra-processed foods and risk of multimorbidity of cancer and cardiometabolicdiseases: a multinational cohort study. LancetReg Health Eur 35, 100771,doi:10.1016/j.lanepe.2023.100771 (2023).
  • 15 Crowe,W. et al. Dietary inclusion ofnitrite-containing frankfurter exacerbates colorectal cancer pathology andalters metabolism in APC(min) mice. NPJSci Food 6, 60,doi:10.1038/s41538-022-00174-y (2022).
  • 16 Bouvard,V. et al. Carcinogenicity ofconsumption of red and processed meat. LancetOncol 16, 1599-1600,doi:10.1016/S1470-2045(15)00444-1 (2015).
  • 17 Morales-Berstein,F. et al. Ultra-processed foods,adiposity and risk of head and neck cancer and oesophageal adenocarcinoma inthe European Prospective Investigation into Cancer and Nutrition study: amediation analysis. Eur J Nutr 63, 377-396,doi:10.1007/s00394-023-03270-1 (2024).
  • 18 Crowe,W., Elliott, C. T. & Green, B. D. A Review of the In Vivo EvidenceInvestigating the Role of Nitrite Exposure from Processed Meat Consumption inthe Development of Colorectal Cancer. Nutrients11, doi:10.3390/nu11112673 (2019).


Connect to the Nation's Largest Health Data Ecosystem

Request a demo

Achieve your boldest ambitions

Explore how Datavant can be your health data logistics partner.

Contact us