The ContentMine & Linguamatics Hackday – Hacked Open

On Friday 11th December 2015, a mixture of 30 students and professionals came together in the Garden Room at EBI-EMBL on the Wellcome Trust Genome Campus in Hinxton, to explore a new text-mining tool for online journals. Before we delve into The ContentMine Hackday itself, here is a quick ‘who’s who’.

ContentMine are an open-source project funded by the Shuttleworth Foundation and are the minds behind a new scientific literature fact extraction tool.  Together with DNAdigest they co-organised the Hackday.

For those of you who are new to DNAdigest, we are a charity run by volunteers and promote the efficient sharing of genomics data to advance scientific research. This not only involves the technical challenges with handling big data but also platforms which allow easy searching and sharing of data and the ethical and legal implications of doing so.

Finally, Linguamatics were the generous sponsors of the Hackday itself. Linguamatics are the worlds leading text mining platform and their software uses advanced Natural Language Processing to help individuals and organisations make sense of ever increasing amounts of unstructured big data.

IMG_9030 IMG_9029 IMG_9027

The Hackday

After short introductions from Fiona Nielsen (Founder of DNAdigest) and Peter Murray-Rust (of ContentMine) and a quick trip for much needed caffeine, all participants downloaded a virtual machine and one by one began loading the ContentMine software. Whilst Peter gave a live demo of the ContentMine tool we quickly learned about it’s various uses. We started by looking into species and searching all journals for references to a particular keyword (and learning such facts as Boa Constrictor is the only species who’s Latin name is the same as it’s English name). Shortly after, we learned about developing more complex search queries, including regular expressions, which involved using symbols such as ‘[ ]’ ‘{ }’ ‘+’ and ‘\s’ to create a tighter filter and thus produce more relevant search results. For instance looking for any character is [A-Z] or any digit [0-9].

With a strong interested in genomic data sharing, DNAdigest sought to investigate how ContentMine could be used to identify references to human genomic datasets in online journals. Because of this, all hackday attendees got a quick tutorial on searching for ‘human genomic data’ journals. Using “quotations” we tested the effects of searching for the term ‘human genomic’ and then the terms ‘human’ and ‘genomic’. It came as no shock that the ContentMine tool found thousands of journals with the terms ‘human’ and ‘genomic’ and it highlighted how spending more time developing the ideal search query would achieve a shorter list of matched journals and ultimately be the best use of time.

After some much needed food, the participants where invited to suggest how they would like to (or could envisage) using the ContentMine tools for their own purposes. From this brainstormed list the room was split into several groups, each with one unique user case scenario to investigate. It was here that DNAdigest really drilled down into the tool and if it could be used to identify references to datasets. Our main goal was to see if these references could be identified and extracted and then put into a machine readable database to enable people to easily search and access data from journals. To create a more specific user case we decided to look for journals which referenced both diabetes and neuropathy, and then attempted to find references to datasets within that list of matched journals.

IMG_9026IMG_9031 IMG_9024

The Conclusion

It became clear that the biggest obstacle of using ContentMine to locate datasets, was that we were not looking for a unique word or phrase. It is much easier to look for the words ‘human’, ‘diabetic’ and ‘neuropathy’, than dataset references which are unique and no two are ever the same. So we had to get clever with our search query and one which could handle variations in structure.

First we searched and downloaded all the datasets which had references to ‘Diabetic’ and ‘Neuropathy’. Then we began looking for the dataset references within this much shorter list. To better explain the challenges, here is an imaginary dataset link as an example: ‘TBGA4927’. Because we do not know the exact dataset reference structure (we are looking for any dataset) we needed to create a search query which could handle all the different variations. So we devised a query that would first look for four capital letters [A-Z]+[A-Z], then four digits [0-9]+[0-9]. Then we incorporated the world ‘accession’ to further filter the journals. To a degree, this search query did work but unfortunately there seemed to be inconsistent patterns in some dataset references. Some dataset references contained a space, some a hyphen and some an extra digit or letter. We did not find many datasets upon first attempt and we also had many false matches, with the journal containing other references which fit the same search query. After several attempts we finally cracked the code and found a query which returned a journal and we could tell form the preview of text it had a link to a dataset. After checking the the journal online we found it to be exactly the research we wanted – on Diabetes and Neuropathy.

IMG_9038IMG_9036IMG_9033So, for the DNAdigest group, we managed to find 1 dataset in 2 hours. Not ideal, in terms of return for investment, but what really took the time was developing the search query. We noticed that other groups had similar problems and they did not manage to finish their user case investigations either, but everyone agreed that the potential for such a tool was huge and once the correct search terms were identified it would be easy for a researcher to locate several journals of interest quickly. So in short, for our user case scenario, it was not ContentMine that was the problem, it was the inconsistent structure of dataset references that was the problem.

After another tea and biscuit break we all presented our findings to the group. Fiona picked 3 user cases which stood out as being well thought out and represented a unique case significantly different from the rest. These groups all received DNAdigest T-shirt prizes. To round off the evening we had a quick final word from Linguamatics, DNAdigest and ContentMine and brief opportunity to network.

All in all, the ContentMine Hackday was a very insightful day. We made lots of new connections and learned about a new tool which could help us fulfil our mission of ethical and efficient sharing of data for all.


Useful links:

Storify – check out our storify of the hackday #CMDNAHack

Hackpad – contains all the notes for the hackday

During the final words from the hackday Jane Reed from Linguamatics informed us that not only were Linguamatics recruiting  but they were also hosting a Spring Text-Mining Conference on April 25 – 27 in Cambridge, UK


You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *