DNAdigest interviews Transcriptomine
This week I would like to introduce you to Dr Neil McKenna who is a principal investigator of the Nuclear Receptor Signaling Atlas consortium. In the following A&Q session you will learn about the tool Transcriptomine which gives the research community ready access to transcriptomic datasets – some background, future plans for improvement as well as step-by-step process for you to start using it for your research.
Dr Neil McKenna, principal investigator of the Nuclear Receptor Signaling Atlas consortium
1. Could you please give us an introduction to Transcriptomine?
Eukaryotic signal transduction involves small extracellular signaling molecules (ESMs) – hormones and growth factors, for example – and transcription factors (TFs), which bind DNA and regulate the expression of target genes. Transcriptomine is an effort to compile, organize and consistently annotate transcriptomic datasets involving ESMs or TFs, and to expose these to the research community so that they can make more effective use of them for their research.
2. What is your role in the project and how does you background support it?
Transcriptomine draws together the talents of a scientific curation and annotation team, with a strong background in signal transduction research, and a web development and information technology team. Financial support for Transcriptomine is provided by the National Institutes of Diabetes, Digestive and Kidney Disease (NIDDK) and the National Institute of Child Health and Human Development (NICHD). As project leader, I oversee the curation team and co-ordinate with the leader of the web and IT team, Dr. Lauren Becnel, to ensure that we stay on track with respect to the goals to which we committed when originally funded by NIDDK and NICHD.
In 1996 I joined the nuclear receptor laboratory of Dr. Bert O’Malley in the Department of Molecular and Cellular Biology in Baylor College of Medicine as a postdoctoral trainee. I worked on characterization of large cellular complexes of coregulators, a new class of signal transduction molecules that had been shown to be required for efficient regulation of gene expression by nuclear receptors, and which were being cloned in growing numbers on a seemingly weekly basis. During my time in the laboratory, expression microarrays garnered growing attention as a powerful alternative to traditional techniques of quantifying changes in gene expression levels, such as northern blotting or ribonuclease protection assay. Although the technology matured rapidly, and paper after paper demonstrated the power that it afforded researchers to survey changes in the expression of large numbers of transcripts in response to cell signals, there was relatively little attention paid to, number one, where and in what form the raw datasets and their associated metadata should be archived, and number two, how to give the research community ready access to these datasets so that they could ask their own questions of them. Looking beyond my postdoctoral training to my own career as a scientist then, I wanted to create an environment in which the average bench scientist could have access to this universe of data points and to use them to catalyze and guide their own experiments.
3. Who had the idea to create Transcriptomine and how did it start? What are its goals?
In 2002, the NIH funded the Nuclear Receptor Signaling Atlas (NURSA), an NIH-funded consortium of investigators in the field of nuclear receptor signaling. The goals of the consortium were two fold: number one, to generate discovery-driven datasets using transcriptomic and proteomic techniques, and number two, to distribute these to the research community through a freely-accessible web site. I led the team that designed the website (www.nursa.org) and curated its scientific content. One of the earliest analysis tools that we (myself, Scott Ochsner and David Steffen) developed was Gene Expression Metasignatures (GEMS). GEMS took a group of microarray datasets that documented the transcriptomic response of MCF-7 cells to 17β-estradiol and gave the user a weighted probability that their gene of interest was regulated by 17βE2 in this cell line. GEMS was limited however, because it required that all of the datasets had a similar experimental design. At one of our weekly meetings, David had the idea of building a resource that allowed anyone, no matter what they are working on, to determine whether their gene or genes of interest are regulated by any nuclear receptor signaling pathway, in any cell line or tissue. This resource is Transcriptomine.
4. Who can benefit from the Transcriptomine tool and is there any limit to who can use it? Where do the data originate?
Although expression microarray and RNA-Seq datasets collectively contain millions of data points that are potentially very useful to researchers to guide their research, it is often extremely difficult for the average bench researcher to ask basic questions of them. To compound matters, datasets are typically annotated in a haphazard manner (if at all) and are often squirreled away in file formats that are convenient for the authors to submit and for the journal to host, but that are informatically incomprehensible. Even when datasets are archived, they are still really only nominally available to most bench researchers. It’s like having a book in your hand but only very poor light to read it by. Most freely accessible databases (GEO, ArrayExpress) are designed as raw data repositories, not as user-friendly analysis tools. There are subscription-based tools available that have more user-oriented query interfaces but there are typically strict limitations on where these can be used, and by whom.
What Transcriptomine does is take these datasets out of their informatically opaque locations, supplement them with additional annotations and modern ontologies, and presents them to end users to allow them to ask questions of the entire universe of data points in an intuitive manner. There are no limits to its use by anyone – these are for the most part data that have already been bought and paid for using taxpayer research dollars and, as such, ought to be made available without restriction and without anyone profiting from this process.
5. Can you walk us step by step into the process of searching with Transcriptomine?
To maximize its utility to a broad audience, the Transcriptomine user interface allows for flexible query construction. The form is designed with five sections, through which the user navigates to assemble their query of choice.
1. Gene (i) Any Gene Returns any gene fulfilling the criteria in the query. (ii) Single Gene As users enter symbol, an auto-suggest functionality displays suggested symbols based on the text entered along with the corresponding official symbol. (iii) Gene Ontology Term To provide a platform for the exploration of higher level regulation of cellular biology by NR signaling pathways, users can specify a GO term (biological process, cellular component or molecular function) as a surrogate for a specific gene or list of genes. (iv) Disease Term. The NCBI Online Mendelian Inheritance in Man (OMIM) database (13) maps Entrez Gene IDs to human genetic diseases, and users can enter OMIM terms to identify NR regulation of genes involved in a particular disease of interest. (v) Gene List. Finally, to accommodate “power” users wishing to query custom gene lists against the database, we provide for uploading of .CSV files of official gene symbols or EntrezGeneIDs.
Regulation This subsection provides for filtering search results based on the amplitude (fold change), direction (induction or repression) and significance.
Regulatory Molecule Transcriptomine documents regulation of gene expression by a broad variety of NR signaling molecules, including NRs, their ligands and coregulators, and users can specify their pathway of interest using this section.
RNA Source Regulation of gene expression by NR signaling pathways is highly contextual, particularly with respect to tissue or cell line. Users can select their tissue or cell line of interest using this section.
Species This section is used to specify results from human, mouse or rat.
Results are presented in the form of a table, with links to more detailed information on each fold change. Power users can download results as an Excel file for more detailed analysis.
6. Can you tell us a bit more about the NIH Big Data to Knowledge initiative and your role in it?
The current model of scientific publishing places datasets a very distant second to the actual research article. Journals or manuscript reviewers for the most part aren’t motivated to invest the effort to ensure that authors archive datasets in public repositories, and authors typically don’t do anything more than they absolutely have to in order to publish their papers. The result has been that millions of data points have been lost to posterity. BD2K is an initiative on the part of the NIH to alter the landscape of biomedical data management and to encourage the community to create new models for annotating, archiving, sharing and analyzing those large datasets whose creation it funds.
My group’s role in BD2K is two fold. Number one, we are making datasets more accessible to the bench scientist. Transcriptomic datasets in many ways are the poor relations of ‘omics datasets – even to the extent of being commonly incorrectly referred to as “genomics” datasets. In contrast to genomic sequence data or protein crystal structures, there has been little serious effort on the part of the cell signaling field to archive these datasets. My group is setting out to make them routinely accessible to scientists in the field, via their phone, laptop or tablet, without having to deal with log-in screens, forgotten passwords or subscriptions – they go straight to high quality, annotated data. Secondly, we recognize that the scientific journal article is here to stay and that it is the primary medium through which many scientists access and consume scientific information. Accordingly, we are working closely with publishers to link relevant articles to interfaces on the NURSA website where datasets associated with those articles can be routinely mined side by side with the article. Our long term goal is to develop win-win models for publishers to incentivize their participation in BD2K and ensure they are part of the process moving forward.
7. Who are your collaborators and what is their part in creating/maintaining the Transcriptomine tool?
The Transcriptomine project is a long-standing collaboration with Dr. Lauren Becnel, Director of Biomedical Informatics at the Dan L. Duncan Cancer Center in Baylor College of Medicine. Lauren’s group draws upon substantial experience in building user-friendly web tools for both the clinical and basic research communities and, frankly, the project would not have been possible without her and the other talented people working in her group. These include the project manager, Dr. Yolanda Darlington, the software development group, including Apollo McOwiti and Wasula Kankanamge (and the recently departed Chris Watkins), as well as the backend team, Mike Dehart and Jane Qu. On the scientific side of things, Dr. Scott Ochsner has been involved with Transcriptomine from the start and with the wider NURSA project for even longer, and has an intimate knowledge of the analysis of differential gene expression datasets.
8. What do you plan for the future development of Transcriptomine?
Funding is tight at the minute and as a result it hasn’t been possible to do expand the group to do everything with Transcriptomine that we would have liked. Having established proof of principle, we would like to extend our coverage to other ESMs and TFs – signaling pathways do not act in isolation and it’s important for investigators to be aware of vertical integration – crosstalk – between different signaling conduits in a specific cell type. Integration with ChIP-chip and ChIP-Seq data will be important as well to increase investigator confidence in regulation of a specific gene by a given signaling pathway. Visualization is also an important goal for us – you can have the most accurate, well annotated data in the world but there will always be an impediment to its adoption if it lacks interfaces that allow users to visually integrate information.
Are you part of a project that facilitates data sharing for genomics research? Would you like to be featured on our blog? We would love to hear from you. Drop us an email at email@example.com or use our contact page to get in touch.