Reflections from my time with the Genomic Data Commons
I have spent the past 13 years at The University of Chicago – most of that time as a faculty of Cancer Research. From December 2013 through January 2017 I spent my days (and a fair number of nights) working with the University of Chicago’s Center for Data Intensive Science. During that time, I had the unique privilege to be integrally involved in what I view as a truly transformational national effort to develop a purpose-built private object storage cloud for the Nations’ cancer genomic data. This is the National Cancer Institute’s Genomic Data Commons, or GDC. The team at the University of Chicago architected and built the GDC starting in 2013, and we launched on June 6, 2016 with none other than Vice President Joe Biden. With forward-looking technology, scale and use cases, we were in a position to make the project a centerpiece of the Cancer Moonshot Initiative. At launch, the GDC became the largest repository of harmonized cancer genomic data on Earth. As I depart the University of Chicago and leave this amazing project standing on firm ground, I find myself reflecting on the larger picture and how much is informs approaches to developing Cloud systems.
What is the NCI-Genomic Data Commons?
The GDC houses all legacy and will house all future human cancer genomic data and associated clinical and biospecimen metadata from a large array of NCI-funded studies. The initial tranche of over five petabytes of data from roughly fifteen thousand patients contains detailed genomic data that provides a detailed window in to the genetic drivers of cancer. All this data was cleaned, processed, standardized and indexed on ingest to provide researchers in America and around the world with the highest quality pre-computed genomic information in an advanced searchable system. Some fifteen million core-hours of compute when in to this process, effectively removing this computation burden for tens of thousands of researchers and vastly expanding the effective pool of researchers able to work with this data. As the GDC grows in the future, the power of this data set will only increase, and democratizing access will inevitably accelerate the pace of discovery leading to new and better cures.
What does Cloud enable?
With the GDC we had the chance to ask the question: What does cloud enable? This is a question that many organizations increasingly need to ask and think through carefully. Making data available across and organization in a transparent, searchable and accessible fashion opens up potential for transformational discovery and new insight. Well-developed index and meta-data services and a flexible and powerful API set the stage for application of data sets far beyond the limited initial scope for which those data were collected. Scalable object storage and flexibility to deploy on-prem or to commercial cloud present new opportunities in terms of the speed to deploy and almost limitless potential to scale. With such parameters, cloud is not just another environment. Rather, it is a chance to rethink approaches and discard old paradigms. It is a chance to develop systems that allow your organization to move faster, to think and act smarter, and accelerate self-awareness.
Rethink with Cloud
Many of the lessons of the GDC apply much more generally. At a basic level, the GDC is about three things: 1) Opening up siloed data and applying compute to standardize in to a data lake. 2) Democratizing access with searchable indexed metadata and a well designed API 3) Combining these to accelerate discovery – in the case of the GDC to speed research in the short term and transform healthcare in the long term. Many groups I have spoken to over the years, be they corporate, government, or institutional, have pools of siloed data. The challenge is to rethink what they can do with Cloud:
• To create new data lakes fed by rivers of data that are cleaned, standardized and indexed on ingest.
• To develop robust API that allow data to be accessed and utilized transparently without concern for proximity.
• To combine orthogonal data sets, and apply new analytics that drive discovery,
• To think through application of cognitive computing (e.g. machine/deep learning) to reveal new intuition.
Open-access = Open-source discovery
In many cases it is worth considering protocol driven data life-cycles that involved opening up your data to the world. This effectively makes discovery and innovation open source, opening up new discoveries and creation of new efficiencies beyond the confines of the corporation or institution. Do not think of the Cloud as a virtual storage and compute environment. Think in terms of how Cloud will let Data Tranform your company or institution at the speed of disruption.