New framework helps streamline EHR data extraction
Researchers from the University of Michigan have developed an open-source framework that streamlines the preprocessing of data extracted from the electronic health record.
The framework, which the researchers call FIDDLE (Flexible Data-Driven Pipeline), has the power to greatly speed up EHR data preprocessing and assist machine learning (ML) practitioners working with health data, according to a study published this week in the Journal of the American Medical Informatics Association.
"By accelerating and standardizing the labor-intensive preprocessing steps, FIDDLE can help stimulate progress in building clinically useful ML tools," wrote the researchers.
WHY IT MATTERS
EHR data preprocessing can vary widely among studies, which makes it difficult to compare different algorithms and ensure that machine learning results can be reproduced. And although some researchers have proposed pipelines, those techniques aren't always generalizable.
"EHR data are messy, often consisting of high-dimensional, irregularly sampled time series with multiple data types and missing values," wrote the researchers.
"Transforming EHR data into feature vectors suitable for ML techniques requires many decisions, such as what input variables to include, how to resample longitudinal data, and how to handle missing data, among many others," they continued.
To that end, University of Michigan researchers developed FIDDLE, which transforms structured EHR data into useful representations for ML algorithms.
"FIDDLE was designed to work out of the box with reasonable default settings, but it also allows users to customize certain arguments and incorporate task-specific domain knowledge," wrote the research team.
In their evaluation of FIDDLE, researchers trained models to predict in-hospital mortality, acute respiratory failure and shock.
"In our proof-of-concept experiments, features generated by FIDDLE led to good predictive performance across different outcomes, prediction times, and classification algorithms," wrote the team.
The researchers noted that obtaining a usable model requires many more steps beyond preprocessing and that FIDDLE only considers the structured content in the EHR.
Still, "though FIDDLE is not a one-size-fits-all solution to preprocessing and further work is needed to test the limits of its generalizability, it can help accelerate ML research applied to EHR data," they said.
THE LARGER TREND
University of Michigan researchers pointed out in their study that FIDDLE can be used in conjunction with other tools to consider unstructured EHR content – which studies show can have big predictive value for clinical research.
Last year, Mount Sinai Health System demonstrated how clinicians and case managers can use natural language processing algorithms to gain insight, particularly concerning social determinants of health, from unstructured content.
"There are a lot of innovations where we've seen natural language processing coming up in a big way," said Varun Gupta, IT director, advanced analytics and data management, at Mount Sinai.
ON THE RECORD
"While FIDDLE is by no means the single best way to preprocess data for all use cases, it facilitates reproducibility and the sharing of preprocessing code (oftentimes overlooked or not fully described in the literature)," wrote the research team.
"We hope that FIDDLE will be useful to other researchers; ultimately, once the community starts using the tool, we will be able to collectively refine and build on it together," they added.