Information management: to federate or not to federate
This is a guest post by Yasmin Alam-Faruque, member of Eagle Genomics’ Biocuration team.
Originally published on eaglegenomics.com
Information management is a key organisational activity that concerns the acquisition, organisation, cataloguing and structuring of information from multiple sources and its distribution to those who need it. From a scientist’s perspective, experimental results are the most important pieces of information that are analysed and interpreted to make new biological discoveries. Unless you are the one generating the results, it is not always an easy task to find and gather all other relevant datasets and documents that you need for further comparison and analyses.
What is the current approach? Currently, sharing of data between researchers is a manual and complex process, which causes inefficiency since a significant fraction of researcher time is spent on this activity. New high-throughput technologies generating huge datasets are compounding the problem. We argue that new information management approaches based on data federation can help address this problem, thus leading to quicker analyses and discovery of new biological insights.
Data federation is a form of data consolidation, whereby data is collected from distinct databases without ever copying or transferring the original data itself. It combines result sets from across multiple source systems and gives users fast access to third-party data for further organisation and analyses. This can be achieved without having to go to the trouble and expense of full data integration or data warehouse creation, also known as data consolidation, which requires large computer storage capacity as well as standardisation and optimisation of the source data.
In this blog we look at federation and suggest where it might be useful in life science R&D.
New Technology & Information Generation in Life Sciences: New technologies for genomics, proteomics and metabolomics are increasingly used in the quest for personalised medicine. These technologies are responsible for the generation of vast, often terabyte-scale and varied datasets. Clinical datasets will also be of varying types since they will have been derived from various commercial medical testing centres using different technologies with associated software packages. Genomic and clinical datasets must be integrated to provide systems-level profiling in the drive towards new biological discoveries in understanding human health and disease.
These advanced methods are also rapidly being applied to drug discovery. The increasing insight into the ways that genes influence biological pathways and human disease, together with the abundance of genetic data, can be used in computational and systems biology approaches to identify and select genetically supported drug targets and indications, which in turn will have a huge impact on the successful clinical development of new drugs.¹,²
Current Challenges in Sustainability & Creation of Data Silos:
IT organisations typically try to consolidate all this vast information and varying datasets into structured and central warehouses and content management platforms (e.g. tranSMART, Documentum) to provide single resources for convenient analysis. This is already a time consuming and expensive process and requires pre-agreement of data standards and formats into a centralized data resource. In the modern era, datasets have grown so large and change so quickly that centralisation is no longer sustainable. Valuable datasets increasingly languish in silos distributed throughout large organisations and their collaborators, often inaccessible or with restricted access.
Federation Enables Better Use of Information: Data federation represents an alternative paradigm to consolidation. Here, the original disconnected and heterogeneous data files, documents and databases remain in their source locations which act as nodes in a federated network system (Figure 1). This also enables the medical centres and research institutes to maintain their own datasets. These locations are referenced from a central interface and are supplemented by standardised descriptions of the experimental context (metadata). Researchers thus acquire a secure and uniform system to access the necessary information across a variety of heterogeneous data resources. The interface is loaded either manually (via biocuration) or automatically (via systems integration).
Data Federation Systems: Examples of federated systems include modern virtual data warehouses such as BioMart and Cisco Composite which are facilitating collaboration, discovery, faster querying and analysis of different data types, swift decision-making, and more efficient business operations.
Impact on Healthcare: Efficient information management becomes more essential within the healthcare industry where large sample sizes and their accompanying clinical and research datasets become indispensable to study various human diseases. Genomic data linked to extensive phenotype and health information will be collected on millions of individuals within a few years, from clinical medicine and also research-driven biobanks³. However, in the real world, access to data in the medical clinic is often restricted and that within biomedical research is totally open, thus presenting an important case where federation becomes critical for advancing our understanding of human health.
Better Integration of Genomic Information: Next-generation sequencing (NGS) is breaking down the barriers between research and the clinic, as genomic and clinical data are responsibly integrated to look for patterns of health and disease towards the transformation and personalisation of medicine. Thus, the Global Alliance for Genomics and Health (GA4GH) was formed to help accelerate the potential of genomic medicine. It brings together over 300 leading institutions within healthcare, research, disease advocacy, life science and information technology enterprises and is working to alter the current reality where data are kept and studied in silos, and tools and methods are non-standardised and incompatible. “A federated approach to data sharing and collaboration will enable secondary use of clinical data for research, increasing our understanding of global and cosmopolitan genetic diversity and environmental impacts on disease, and will allow long-term engagement with the clinical community” quotes Dr. Ewan Birney – Director of the European Bioinformatics Institute (EMBL-EBI).
Data Federation and Patient Health Records: Federation is becoming a crucial reality within the healthcare enterprise itself, as it progresses towards adopting the electronic health records (EHRs) system, both in the UK and USA4. The plummeting cost of whole human genome sequencing is enabling the combination of this data together with individual health records and test results from clinical and genetic investigations into an EHR. Federated EHRs will be of huge benefit not only to physicians and the researchers but also to the patients themselves providing support for them to self-manage their own health. This can be demonstrated by the adoption of Datawell, an innovative informatics platform which enables health data to be shared, by The Greater Manchester Academic Health Science Network (GM AHSN).
Data collected for routine care by EHRs would ideally be used in the native format, but in reality, every brand of EHR generally stores data in a unique, proprietary format, which needs to be extracted for meaningful analysis in another software system. While federation allows secure access to each medical health centre, on-the-fly translation of EHR data into established data standards enables cost-effective use of the datasets.
Such systems can also enable the identification of specific patient cohorts and biospecimens for in-depth research projects and become an essential asset for clinicians in order to provide the best personalised medical treatment, ensuring the well-being of their patients.
Future of Collaboration: The National Health Service in the UK plans to sequence 100,000 patients by 2017 via Genomics England5. With appropriate consent, anonymisation, and control, genome sequence information will be linked to patients’ EHRs and will be made available to researchers. Hence, federation will become an essential means for collaboration between different organisations working together on large healthcare projects (e.g. GA4GH and the 100,000 Genomes Project), to address complex genetic questions in silico. For example, identification of specific patient cohorts to study the health consequences of a particular gene mutation, or the possible safety issues for a drug targeting a specific gene in individuals who carry genetic variants whose effects mimic, or are similar to, the proposed drug.
Data Security: Even though clinical and genomic datasets are anonymised, security is of paramount importance. In a federated system, access and transfer of data can be controlled and encrypted and hence security is more robust than that within a consolidated database. Controlled access to federated data is via a central interface and since the original data remains in its native storage location, if the retrieved information is accidentally deleted or altered, it can be rapidly and seamlessly recovered from the original silo. Modern software platforms such as Eagle Genomics’ eaglecore, are designed to enable federation of large datasets, providing a central, safe and secure avenue to accelerated data discovery, integration, sharing and collaboration.
Questions for the future: What are your thoughts on challenges to better information management in life science R&D? What are the enablers? What can we expect for the future?
- Nat Genet. ( 2015); http://dx.doi: 10.1038/ng.3314.
- Biomed Res Int. (2013);http://dx.doi.org/10.1155/2013/742835
- PLoS Biol. (2015); http://dx.doi:10.1371/journal.pbio.1002216
- Nature Biotechnology (2015); http://dx.doi:10.1038/nbt.3180
- PLoS Biol. (2015); http://dx.doi:10.1371/journal.pbio.1002216