Researchers pilot 'model to data' approach to developing predictive algorithms

A new JAMIA study suggests it's possible to protect sensitive patient data while developing models to ultimately optimize care.
By Kat Jercich
01:40 PM

Patient privacy concerns can make it challenging to develop predictive models using data from electronic health records. But a new study published this week in the Journal of the American Medical Informatics Association suggests that researchers can use the "model to data" approach to develop models without having direct access to patient data.

"The focus of the MTD framework is to deliver containerized algorithms to private data, of any standardized form, without exposing the data," said University of Washington researchers in the study. 

"With increased computational resources, our platform could scale up to handle submissions of multiple prediction models from multiple researchers," they said.

Learn on-demand, earn credit, find products and solutions. Get Started >>


EHR systems offer a bounty of patient data for potential use in predictive models, which in turn can allow providers to allocate resources and staffing and streamline care. 

However, healthcare institutions must also consider patient privacy regarding EHR data. Doing so is possible via data de-identification and synthetic data creation, the University of Washington researchers wrote, but neither method is without its disadvantages.

"De-identification reduces the risk of information leakage but may still leave a unique fingerprint of information that is susceptible to reidentification," wrote the study authors. Although "de-identified datasets like MIMIC-III are available for research and have led to innovative research studies," they noted that such datasets are limited in size, scope or availability.

No method of synthetic data creation, the researchers explained, "can generate an entire synthetic repository while preserving complete longitudinal and correlational aspects of all features from the original clinical repository."

In a pilot study, University of Washington researchers attempted to investigate the viability of a third solution: the MTD framework, in which model developers send models to an isolated environment for training and evaluation on sensitive data.  

"We selected all patients who had at least one visit in the UW OMOP repository, which represented 1.3 million patients, 22 million visits, 33 million procedures, 5 million drug exposure records, 48 million condition records, 10 million observations and 221 million measurements," the researchers wrote.

For this MTD study, the UW team asked the model developers to create a model predicting the likelihood of patient mortality within 180 days of the patient's last visit. 

"This model was first tested on a synthetic dataset ... by the model developer to ensure that the model did not fail when accessing data, training, and making predictions," the researchers explained. After the model was submitted to the UW computing environment, they continued, "the [Common Workflow Language] pipeline verified, built and ran the image through 2 stages, the training and inference stages."

Developers were able to create three different models using demographic information, demographic information and five common chronic diseases, and demographic information and the 1,000 most common features from the EHR's condition/procedure/drug domains. The implementation of the first had an AUROC of 0.693; the second had an AUROC of 0.861; and the third had an AUROC of 0.92. 


Despite its potential drawbacks, researchers and developers have pointed to synthetic data as a way to address the problems of using real-world health information. 

“Synthetic data is likely not a 100% accurate depiction of real-world outcomes like cost and clinical quality, but rather a useful approximation of these variables,” explained Robert Lieberthal, principal for health economics at the MITRE Corporation, in a HIMSS20 Digital presentation earlier this year. 

“In addition, synthetic data constantly is improving, and methods like validation and calibration will continue to make these data sources more realistic," he said.

And such uses are increasingly timely: Earlier this summer, the Veterans Health Administration announced a challenge to predict COVID-19 outcomes among veterans using synthetic health data.


"The prevalence of EHR systems in hospitals enables the accumulation and utilization of large clinical data to address specific clinical questions. Given the size and complexity of these data, machine learning approaches provide insights in a more automated and scalable manner," wrote the UW researchers. 

"Healthcare providers have already begun to implement predictive analytics solutions to optimize patient care, including models for 30-day readmissions, mortality, and sepsis," they added. "As hospitals improve data capture quality and quantity, opportunities for more granular and impactful prediction questions will become more prevalent."

Actionable Intelligence

This month, we look at lessons from the COVID-19 pandemic on how data is put to work informing patient care decisions and population health.

Kat Jercich is senior editor of Healthcare IT News.
Twitter: @kjercich
Healthcare IT News is a HIMSS Media publication.

Want to get more stories like this one? Get daily news updates from Healthcare IT News.
Your subscription has been saved.
Something went wrong. Please try again.