DNAdigest interviews Intel
Big Data Solutions is the leading big data initiative of Intel that aims to empower business with the tools, technologies, software and hardware for managing big data. Big Data solutions is at the forefront of big data analytics and today we talk to Bob Rogers, Chief Data Scientist, about his role, big data for genomics and his contributions to the BioData World Congress 2015.
1.What is your background and your current role?
Chief Data Scientist for Big Data Solutions. My mission is to put powerful analytics tools in the hands of every business decision maker. My responsibility is to ensure that Intel is leading in big data analytics in the areas of empowerment, efficiency, education and technology roadmap. I help customers ask the right questions to ensure that they are successful with their big data analytics initiatives.
I began with a PhD in physics. During my postdoc, I got interested in artificial neural networks, which are systems that compute the way the brain computes. I co-wrote a book on time series forecasting using artifical neural networks that resulted in a number of people asking me if I could forecast the stock market. I ended up forming a quantitative futures fund with three other partners, which we ran for 12 years. That was my first experience with big data, since we were working with a large amount of historical and streaming tick-by-tick data from the markets.
In 2006, I was ready to move to a more personally fulfilling phase in my career, so I refocused on healthcare. I became product manager for a medical device that is the global gold standard for glaucoma care. This experience really helped me understand how the healthcare ecosystem works, and how challenging the obstacles to using data analytics in healthcare can be.
In 2009, the U.S. government did something very interesting: it began incentivizing physicians to use electronic health records (EHRs). The idea was to improve the efficiency and quality of healthcare delivery by liberating healthcare data. Healthcare data began to move from handwritten, paper silos to electronic silos. The problem was that EHR systems do not generally communicate with one another, so I co-founded a company, Apixio, with the idea that we would bring all the data together in the cloud and organize it in a way that would be useful to healthcare providers and healthcare delivery systems.
Along the way, we discovered that there are major issues with the quality of structured (coded) data in the EHR, and that the clinician’s text is the most important information in the clinical record. This led us to develop a Big Data analytics platform to analyze structured data, clinical text and scanned documents together to help understand the True State of the Patient.
Apixio’s work is ongoing, but in the meantime I had the opportunity to join Intel as Chief Data Scientist for Big Data solutions in January. It’s been the best move of my career.
2. What is the most influential achievement in your career?
The development of the Apixio Big Data stack that has allowed us to reveal the True State of the Patient through advanced analytics is a very proud accomplishment. The key is that this view of the patient is computable, so that it can be used as the foundation for healthcare system optimization, care delivery improvement and disease management, all of which lead to better outcomes and superior healthcare experiences for patients.
3. What are you bringing to the BioData World Congress 2015? Do you have any expectations about it?
I bring the perspective of an outsider data scientist with deep understanding of the clinical side of the equation. I am not a genomics expert, but I am passionate about the potential of supporting the life science community with the analytical tools of Big Data. I hope to come away from the event with some new relationships to individuals and companies that are driving the need for new analytics technologies in the life sciences.
4. How do you explain BIG DATA to lay people? What is the main power of BIG DATA, in your opinion?
Big Data is simply the ability to analyze any kind of information to yield robust, repeatable, actionable insights. In the past, there have been many obstacles to meaningful data analytics. For example, huge data sets were difficult or impossible to process in a single analysis on a single computer. Textual and image data took too much computing time to analyze, or the results were too noisy to use. Streaming data was difficult to manage and impossible to respond to in real time. The technologies that allow us to overcome these obstacles are all part of the Big Data world.
It’s nice for data scientists to be able to analyze all these different data types, but the real question is, “What does that get me?” Here’s an example. Imagine that I need to create a search engine for medical data that understands which clinical information is relevant to a specific disease. I need to know how millions of clinical terms are related to one another in the context of the disease in question. From an analysis point of view, this is a large network or graph, in which each node is a term and the link between two terms measures their relationship. To use such a system to search for relevant data requires access to the entire graph at once, since you don’t know ahead of time what terms might be relevant to the search. This means immediate access to 1 million times 1 million, or 1 trillion term-term relationships. If this isn’t a Big Data problem, I don’t know what is!
At DNAdigest, we are specifically interested in genomic data. Therefore, a few questions on this topic:
5. In your view, how much genomic data is needed to make statistical correlations in the sense of ‘BigData analysis’?
The amount of data you need to make a meaningful correlation depends a lot on what kind of association you are trying to establish. To connect a single SNP with an outcome is confounded by the sheer size of the genome and the intrinsic variability of our current measurement methods. Ultimately, depending on the condition you are looking at, only a percentage of your sample population will have the condition, so you will need to start with a large population to find a useful cohort. Whether the result hold up then depends on regular old statistics.
Where Big Data can help is to speed the search for candidate patients for a specific study and to speed the identification of interesting associations that arise from more complex genomic variants.
6. What do you see as the biggest challenges for producing genomic data sets that can be analysed in this way?
I see three major challenges:
1. The inherent variability in our genetic measurement methods increases the number of data points required for any measurement to be believable
2. As we learned at Apixio, clinical data requires significant infrastructure to analyze. Specifically, data comes from multiple sources and structured data must be denoised to make it reliable. In fact, many of the interesting facts about a patient are mentioned in clinical text, but not coded (such as the patient’s appearance or the presence of noisy breathing), so this information must be extracted from text with NLP and text mining tools.
3. The ability to aggregate genomic data across multiple institutions. Genomic data is expensive to collect and is seen as a valuable asset to the research and clinical organizations that collect it. But as we’ve already discussed, to infer useful information from it, we need as many data points as possible, so data sharing is crucial. One solution is a secured, federated approach as has been demonstrated by Intel and Oregon Health and Science University with the Collaborative Cancer Cloud project.
7. How do you see your work supporting best practices for data governance and data management?
I work with Intel legal and customers to define key uses cases and what are the appropriate data retention policies for these use cases. We also look at policies for data protection and stewardship. A key area of work for me is to define metrics for identifying the value of data so that appropriate business models can be built around it. Guidelines for uses of data and legal language to support these uses are a key output of these efforts. More broadly, Intel has developed some key technologies around data security and around authentication and access control for data so that the friction introduced by data security is rapidly diminishing.
Are you part of a project that facilitates data sharing for genomics research?
Would you like to be featured on our blog?