Christopher G. Chute, M.D., D.R. P.H.

Professor of Medical Informatics

Mayo Clinic

Big Data meets Healthcare: The case for comparability and consistency

Friday, November 22, 2013, 12:00PM

Packard Lab Room 466

Abstract: The well-known phenomenon of “information explosion” has impacted virtually all areas of human enterprise, and healthcare has become no exception.  While one might quibble whether more information is actually being created, there is no disagreement that vastly more information is being electronically captured and stored.  Latent within the proliferation of such machine readable archives of information lays previously impractical metrics, capabilities for linkages and association, and ultimately new knowledge.  The over-used moniker of “big data” is applied to the rise of vast, potentially-federated data sources, analytic methods for their interpretation, and emergent findings.  Despite this non-precision, most observers agree that there is something new and different emergent in the opportunistic mining of disparate data on an unprecedented scale.

Examples of impressive inferences from big data abound in finance, marketing, education, social sciences, and economics.  More focused, “big science” opportunities are self-evident in astronomy, physics, and arguably the discovery of the Higgs Boson (which really was inferred from perturbations observed across Exabytes of experimental particle-accelerator data).  In biology and medicine the sweet spot has historically been in the human genome, where genotype-phenotype associations emerge from “genome-wide association studies” done at massive scale – more so in the present ere of whole-genome sequencing.

The promise of best-evidence discovery, comparative effectiveness research, new outcomes analyses, adverse event discovery, and improved clinical care in general that might emerge from big-data methods applied to large, federated, clinical data repositories is intriguing.  There is “gold in them hills,” and the potential benefits of well-conducted data mining must not be lightly dismissed.

However, caution must dominate an otherwise unfettered analyses of clinical information, as the consequences of skewed, biased, spurious, or otherwise “wrong” answers can have serious adverse impact.  While most of us are quite content to have a target answer appear “on the page” of a Google search result, somehow having the right answer “on the list” but not chosen for healthcare interventions may be interpreted as malpractice in some litigious countries – not to mention likely sub-optimal outcomes for a patient.  Clinical decision support resources may recommend a spectrum of options to a clinician – who presumably has the responsibility of synthesizing such advice and selecting the optimal path, though few would argue that the amount of information and the complexity of their interactions have long ago exceeded the unaided human capacity for cognition, reliable processing, or well-balanced interpretation.

The more insidious risk of blindly applying big-data methods to large clinical repositories is the underlying heterogeneity of clinical data representation, both syntax and emantics.  Syntactically, recognizing for example “heart disease” in a patient’s record may classify that patient into algorithmic risk groups, though if that rubric is nested under an information structure containing family history information, that risk assignment may well be inaccurate.  Similarly, if a group of patients are found with “renal cancer” and a separate group are found with “kidney cancer,” no amount of big-data inferencing will reconcile their similarity absent an ontological assertion that these categories are synonymous.  The risk of misclassification of clinical data is vast, more so in vast databases managed through conventional big-data methods.
The importance of comparable and consistently represented clinical information, either at entry or through normalization to a canonical form, must remain as a necessary step before big-data methods can be meaningfully or safely applied to clinical data repositories

Bio: Christopher G. Chute, M.D., Dr. P.H., established the Division of Biomedical Informatics at Mayo Clinic, overseeing a program of applied research and development focusing upon clinical and genomic data sources, management, standardization and interpretation. He was division chair for 20 years, and is presently section head of Medical Informatics and Professor of Medical Informatics.

© 2014-2016 Computer Science and Engineering, P.C. Rossin College of Engineering & Applied Science, Lehigh University, Bethlehem PA 18015.