Commentary: Exploring the potential of graph analytics in healthcare

By David Anstey
10:33 AM

Speed matters in medical care. It changes the balance between life and death — in the obvious places like emergency rooms but also in finding proper treatment fast enough for patients with chronic diseases such as cancer, heart disease, renal failure, Type 1 diabetes, or even arthritis.

Provisions in the Affordable Care Act that base payments in large part on successful outcomes also make right-the-first-time treatments more important. The challenge is in finding those treatments (and gaps in patient follow-through) to ensure the greatest benefit with the least cost, risk, and re-admission to hospitals.

Scattered data
All of the information a diagnostician might need likely exists, but it’s just not all in one place, all associated with the same disease, or all find-able with standard query techniques using standard database technology. The structures are too diverse (including unstructured information) and the query parameters too limiting. And given the size of medical datasets, which Graham Hughes, the CMO of the SAS Center for Health Analytics and insights, said reached 150 exabytes in 2011, the search tasks are gargantuan.

There is another approach to Big Data analysis: graph analytics. With graph analytics and the hardware it runs on, there is no need to normalize data or create homogeneous schema prior to integration — a task that can consume weeks or months of skilled database administrators’ time before querying even begins.

Data agnostic
Graph analytics allows any structured or unstructured data to be examined “as is” — without first making everything uniform — and readily available tools can quickly convert relational data into graph-compatible resource description framework (RDF) triples. In that format, hundreds of terabytes of data can be loaded into memory and processed in parallel so that, unlike SQL queries which can slow down or stop entirely as they get increasingly complex, results can be returned in seconds using SPARQL — a “fuzzier” query language that can detect similarities or patterns that are very broadly defined.

In fact, a multi exabyte data set being accessed by a complex query has to reduce the uncertainty in the algorithms upon which the query is based, and that can only be accomplished through building assertions about how the data interacts. Graphs and enormous data sets are unpredictable and, as such, uncertain. So it’s essential to make the right assertions by learning what is in the data set and discovering how data connects, how inferences enhance the data, which groups are sparse, and which dense. The only way to accomplish this at any scale while effectively reducing the uncertainty is by loading everything in-memory in unpartitioned form.

The improvement of healthcare will require a much deeper understanding of a huge data set before it’s possible to start asking the right questions. These questions, the highly complex queries, require validation which is, again, an in-memory issue.

Healthcare has data and dimensions which have never been leveraged while more information is added every second. So the enormity of the data sets is growing ever more enormous. Insights on improving healthcare and optimizing providers’ effectiveness can be found in the historical data, but any analysis must incorporate the most recent information. Insights, therefore, won’t be gleaned from a batch job that examines a historical data dump. The data set is constantly changing; constantly capturing the most recent developments in medicine and more accurately reflecting the population of patients and their healthcare issues.

Queries must be centered on the relevant issues and concerns whether they’re prevalent or anomalous, and that requires a detailed understanding of the data through discovery efforts, rapid hypothesis testing, and iterative analysis.

Needless to say, complex queries on large data sets can’t afford to be delayed. Yet the high cost — in terms of the time, labor, and computing resources required to restructure data sets and reindex them for new queries — makes this level of analytical depth untenable on commodity/cloud infrastructure.

For healthcare outcomes to improve, discovery needs to be low cost and highly responsive. So, while diagnosis might ordinarily be siloed — examining only patients with the same condition or symptoms — graph analytics and the hardware it runs on let researchers or physicians use in-memory processing to examine all available patient records to discover what might be relevant. Any large medical center is likely to have millions of patient records in a wide variety of formats and data siloes which can’t be affordably and quickly queried using standard database techniques. They’re likely to return useful insights only after months of time preparing the data, running queries and generating reports.

With graph analytics, medical center physicians rely on “evidence based” medicine which relies on the patient’s demographics, physical condition, vital signs, symptomology, personal and family medical history to determine which treatments are most likely to succeed based on the care provided to others who share similar characteristics. It’s a process that can be expected to be completed in minutes or seconds, providing accurate diagnosis and treatment recommendations while a patient is still in the examining room.

Aggregated knowledge and insight
The impact on effectiveness is significant. If evidence-based results can be tapped to reveal relationships that are outside a practitioner’s area of expertise or knowledge, to uncover the significance of particular symptoms to improve diagnosis and to identify at-risk situations for particular cohorts under care, the timeframe shrinks exponentially and the rate of success goes up.

On a far larger scale, graph analytics’ potential is being explored in the realm of public health, especially as a way to prevent the spread of disease, detect the potential for outbreaks and determine the approaches with the greatest likelihood for containing and treating those affected. In this context, the Centers for Disease Control and Prevention (CDC) is turning to Big Data and electronic health records to improve bio-surveillance and outbreak prevention. Graph analytics can ensure that all available data are considered, processed as a whole and queried to reveal what’s expected, suspected, and completely unknown. The margin of error shrinks based on what is known, and discovered facts are revealed in seconds, instead of in days, weeks, or never when relying on guesswork, trial and error, and traditional queries.

The Food and Drug Administration (FDA) is focusing on similar improvements through pre- and post-market analysis of drugs and devices. Compiling every bit of clinical, in-practice and post-treatment data, in-vestigators will be able to determine strengths, weaknesses and dangers virtually as they occur. That’s possible because the capacity of graph analytics memory and processing hardware permits dynamic updates on-the-fly as new information comes in. And as those updates reveal new insights and relationships, queries can be modified to explore additional possibilities — also in real time — to fine-tune insights in a matter of minutes.

Evolutionary technology
Graph analytics is not a replacement for traditional databases or even Hadoop clusters. It’s a complementary approach that can either pick up where traditional approaches stalled or combine heterogeneous sources to save pre-query efforts and manpower. As a benefit of that, standard hardware becomes available for analysis of “knowns” — the identity of people who were prescribed a recalled drug and need to be given a new one, for instance — or non-time-sensitive inquiries.

Bottom line issues
The same approach to finding data relationships that can help cure and treat disease can also be used to spot unusual patterns of activity in referrals for additional care or billings for that care. If a non-ambulatory patient is referred to a specialist more than 100 miles away that’s an easy red flag to spot, as is a request for payment for treating an individual who can’t be identified as a patient. What’s harder are billings for local treatment with payment made to distant locations or third-party entities for duplicate submissions to multiple payers, or treatment of people who died.

Graph analytics can be used to connect the dots in these situations while finding new dots and putting them under surveillance, preventing fraud before it happens, instead of incurring high costs to reverse it. It’s an-other instance of speed making measurable differences in outcomes, ones that preserve resources for administering patient care that’s accurate, effective, and successful.

On a national scale, the numbers are humbling — trillions of records to analyze, billions of dollars to save. No one entity controls or is responsible for all of it, but that only makes improvements that much harder to achieve. The promise, however, is that all that data, when examined with graph analytics, can be rapidly understood, correlated, and turned into multiple forms of prevention and consistently accurate treatment and ultimate cures.

The graph of life
With graph analytics, there are clear advantages that are unavailable with any other technology.

  • Virtually immediate access to any source or type of data, regardless of its structure, without the need for homogeneity.
  • The integration of far more data to ensure that every relevant relationship can be discovered.
  • Query performance that is orders of magnitude faster than traditional queries run on commodity hardware and that can provide accurate in-office or bedside diagnoses and effective treatment recommendations practically on-demand.
  • Scalability and extensibility that accommodates dynamic updates to data and produces updated query results in real time.

These characteristics can transform information into action and, in the process, improve the outcomes of treatment and the survival rate of more and more patients.

David Anstey is the Global Head of Life Sciences at YarcData.