Genome data for integrated care - Enter the video people

HIMSS Insights speaks to Mikel Hernaez, director of computational genomics at the Carl R. Woese Institute for Genomic Biology at the University of Illinois.
By Philipp Grätzel von Grätz
08:40 AM

The size of a human genome dataset depends on many factors. In any case, it is huge. An estimate often given suggests 100 gigabytes for the original raw dataset. It can be even more, depending on how fully a “full” genome is sequenced. Files of that size are difficult to handle on conventional computers, but even moreso on mobile devices. If we adhere to the vision that the smartphone will become the modern citizen’s personal steering wheel for all health and disease related tasks, the question arises around how to make genomics (and other -omics) data available on such platforms.

This is just one article in the new HIMSS Insights eBook: Connected Care and Interoperability. Download the full edition here.

Mikel Hernaez, director of Computational Genomics at the Carl R. Woese Institute for Genomic Biology at the University of Illinois, is among those who are trying to provide an answer to this question. Hernaez is part of a standardisation initiative, in which the Moving Picture Expert Group (MPEG) and the ISO Technical Committee 276 are joining forces: “Our goal is to produce MPEG-G, a new open standard for genomic information representation. It is expected to aid significantly the storage, transmission, and most importantly, the processing of raw and aligned sequencing data.”

MPEG-G use cases include privacy rules and selective encryption

Together with colleagues, Hernaez has recently authored the first comprehensive publication on the MPEG-G work on the bioRxiv preprint server. He sees a considerable need for a new genomic compression standard: “Genomic data is still largely stored using compression algorithms that perform poorly in practice. I believe that the fact that most existing compressors are not guaranteed to be long-term maintained poses the largest barrier for adoption of genomic data compression technologies.” With MPEG-G, this will be different, according to Hernaez: “Given its ISO standard designation, it is guaranteed that a file compressed according to the MPEG-G specifications will be accessible for life.”

Next to long-term performance guarantees, another strength of MPEG-G is that it addresses a wide array of use cases. These include selective access to compressed data according to several criteria, genomic studies aggregation, enforcement of privacy rules, selective encryption of sequencing data and metadata, annotation and linkage of genomic segments, and incremental update of sequencing data and metadata. “Some of these use cases are not or only partially covered by other technologies, for example enforcement of privacy rules and selective encryption. This is a major breakthrough. The current genomic data files do not include any type of implicit security mechanisms to ensure that the person or system accessing the file has been granted permission by the owner of the data.”

Convincing people in biology remains a challenge

Developing a new standard from scratch is not easy, of course: “A main challenge has been and continues to be the interdisciplinary nature of the project,” Hernaez points out. “The expertise of the people involved in this project ranges from biology, to engineering, computer science and genetics. As you can imagine, working in such an interdisciplinary environment is quite challenging.”

In addition to organising interdisciplinary issues, there is another challenge: MPEG is a standardisation community that has its roots in video compression. “Developing MPEG-G is also about convincing people in the biology space that the ‘video people’ are capable of producing a great new format for compressed representation of genomic information,” says Hernaez. In fact, not all people involved in MPEG-G are ‘video people’, with Hernaez himself, an expert for computational genomics, being the prime example. “But it is still a challenge. We hope that with the release of the first implementations of the standard we will convince people to, at least, try it.”

How much can a genome be shrinked?

The “video background” is in fact helpful to a certain extent, says Hernaez: “Some aspects of MPEG-G are vastly different from video-coding, for example the modelling part. But on the other hand, the current entropy codec used in MPEG-G has been partially borrowed from video-compression.” One of the aspects that has been criticised after the first MPEG-G publication in autumn 2018 was that there weren’t any comparisons to existing standards in terms of compression performance.

For Hernaez, this is a misunderstanding of the intentions and the scope of the MPEG-G work: “The true advantages of MPEG-G do not lay in the compression capabilities per se, but in the ‘wrapper’ developed around the core compression technology. This wrapper really promotes competition and interoperation across different solutions. Having said this, we can have a tentative measure of the compression capabilities that MPEG-G could achieve by looking at the technologies specified in the specifications. For example, the SPRING technology, that is included in the MPEG-G standard, can achieve a 25-fold reduction of raw sequencing data size on data generated by the newest Illumina sequencing machine.”

This would mean that the 100 gigabytes genome mentioned in the beginning could be reduced to about four gigabytes. An even bigger reduction in size would be possible, if lossy compression is permitted, a mode in which – again in analogy to video-compression – some precision is lost without affecting downstream analyses. Hernaez points out that since MPEG is part of ISO, the standard development process is open for anybody interested in contributing. To make it easier, some of the academic partners involved in MPEG-G have recently created a group called MITOGEN (Mutual Initiative to Tackle Optimal Genomic ENcoding) to share openly all software developed by the groups involved. An open source encoder/decoder based on the MPEG-G specifications called ‘Genie’ is also under development.

This is just one article in the new HIMSS Insights eBook: Connected Care and Interoperability. Download the full edition here.

Healthcare IT News is a HIMSS Media publication.