Data is probably the most powerful tool in research medicine, promising to leapfrog over long-standing barriers and deliver dramatic advances for patients and communities. Yet, the issue of data privacy adds enormous challenges and complexity to achieving the potential that big data analytics offers, in healthcare and beyond.
The growing focus and public concern around data security and privacy has compounded the existing issues of data sharing in healthcare, as the fragmented structure of the U.S. healthcare system inhibits easy access to medical records for both providers and patients, and the lack of interoperability across electronic health record systems limits the ability to share critical patient data in a timely manner to support care coordination and research or tracking using real-time data.
Regulatory science and ethical-legal standards have failed to keep up with technology, often presenting barriers to the rapid access and effective use of data, especially at scale. HIPAA, the prevailing law governing personal health data, was enacted before the internet age and remains largely unable to address the already prevalent use of large, de-identified data sets for artificial intelligence and other large-scale analytics. Furthermore, direct patient consent cannot reasonably be obtained from the hundreds of millions of patients whose historical records are siloed throughout the healthcare ecosystem.
Fortunately, the massive acceleration of data science in the past decade has enabled a technological breakthrough—synthetic data—that promises to solve many of these challenges—not by regulation but by data science itself. Generated by advanced machine learning models, synthetic data can be realistic but not real data that accurately and reliably mimics all of the statistical properties of the original data without exposing any actual patient information, fully protecting patient privacy and bypassing the need for further regulatory requirements.
Using synthetic data, healthcare organizations can create, share and access fully secure patient-level data, decreasing the burden for internal access and analysis and opening the door for more external collaboration. For life science companies, this could mean reusing existing clinical trial data outside of the original scope of informed consent, gaining portable access to health system data, and increasing cohort sizes for rare disease data, to name a few.
While these synthetic data sets are undergoing additional research to confirm their accuracy and reliability, they are also actively being used and tested in real-world settings.
The NIH's National COVID Cohort Collaborative (N3C) is leveraging this technology to generate a non-identifiable synthetic version of the largest available repository of patient-level COVID-19 data, enabling greater access to the massive amounts of clinical data needed to advance research efforts. This unprecedented, collaborative access to meaningful data has implications far beyond the pandemic, with potential applications for synthetic data across therapeutic areas and in future public health emergencies.
Throughout my career as a physician and researcher in the fields of medical ethics and quality of care, I have witnessed firsthand the complex struggle between the two moral imperatives of the need to protect patient privacy and the ethical mandate to share medical data so that life-saving treatments can be developed, tested and approved for use in human patients. The ongoing COVID-19 pandemic has more starkly emphasized the need for large data sets that are readily available, the tragic loss of life when data is not available, and the power of what is possible when that happens.
It is rare that technology is the solution to an ethical problem, but if done right, synthetic data holds great promise to revolutionize digital science in the healthcare industry to accelerate progress and innovation, strengthen public health, bring more personalized treatment options to patients, and ultimately improve patient outcomes.