Huge dataset of biological images made available to spur new AI algorithms

Recursion releases more than 300 gigabytes of data it hopes will be a "playground" for innovative new machine learning applications.
By Nathan Eddy
11:08 AM

Clinical-stage biotechnology firm Recursion announced the release of an open-source biological dataset, RxRx1, which the company has been building for more than five years.

WHY IT MATTERS
The dataset is composed of images of human cells from more than 1,000 experimental conditions with dozens of biological replicates produced weeks and months apart in a variety of human cell types.

The collection of data represents a potentially vast resource for the machine learning community, with more than 100,000 images and 300-plus gigabytes of data representing diverse biological contexts.

"To answer fundamental questions facing biology and disease, and reimagine the drug discovery paradigm, we're building the world's largest, relatable, empirical biological dataset," said Recursion CEO Chris Gibson in a statement.

The data, generated at multiple Recursion sites under highly controlled experimental procedures, could also provide an arena for scientists working in multiple areas of machine learning research, such as domain adaptation and k-shot learning – each batch of experimental data contains unique experimental variations.

"Despite the massive scale of this dataset, it represents just 0.4 percent of what we generate at Recursion on a weekly basis," Gibson added. "We expect that the richness of this dataset, combined with the context surrounding the scale of our efforts, will inspire the world's machine learning and AI community to help us in our mission to decode biology to radically improve lives."

Gibson predicted if the release helps enable collective efforts, new treatments would make it to market faster and more companies would be incentivized to develop new drugs for smaller markets, such as rare diseases, where many patients still face a major unmet need.

THE LARGER TREND
Advances in machine learning methods outside of the life sciences have already been accelerated through the availability of large-scale public datasets, such as ImageNet and COCO, among many others.

Like those initiatives, Recursion's dataset aims to create resources that will help enable the community to collectively identify and adopt new machine learning methods that benefit the entire life sciences industry.

The company's relatable database of more than two petabytes of biological images generated in-house on the company's robotics platform helps enable machine learning approaches to reveal drug candidates, mechanisms of action, and potential toxicity.

ON THE RECORD
"We are excited to provide the data science community with the first longitudinally-generated, human cell biology image dataset to facilitate new machine learning applications," Recursion's chief technology officer and chief product officer Mason Victors said in a statement.

By combining experimental biology and automation with AI in a massively parallel system, Recursion hopes to improve the efficiency of discovering potential drugs for diverse indications, including genetic disease, inflammation, immunology, and infectious disease.

"This dataset provides a great playground for those working in multiple areas of machine learning research, such as domain adaptation and k-shot learning," said Berton Earnshaw, Vice President of Data Science, Recursion. "Developing methods to account for the non-random experimental noise is something that should be of interest to those beyond just the life science community."

Nathan Eddy is a healthcare and technology freelancer based in Berlin.

Email the writer: nathaneddy@gmail.com

Twitter: @dropdeaded209 

Healthcare IT News is a HIMSS Media publication.