In biomedical R&D, researchers use text mining tools to extract and interpret facts, assertions and relationships from vast amounts of published information. Mining accelerates the research process, increases discovery and helps companies identify potential safety issues in the drug development pipeline. However, despite the many benefits of text mining, researchers face a number of obstacles before they even get a chance to run queries against the body of biomedical literature.
Content mining, machine learning, text and data mining (TDM) and data analytics all refer to the process of obtaining information through machine-read material. Faster than a human possibly could, machine-learning approaches can analyze data, metadata and text content; find structural similarities between research problems in unrelated fields; and synthesize content from thousands of articles to suggest directions for further research explorations. In consideration of the continually expanding volume of peer-reviewed literature, the value of TDM should not be underappreciated. Text and data mining is a useful tool for developing new scientific insights and new ways to understand the story told by the published literature. Continue reading “Unrestricted Text and Data Mining with allofPLOS”
We are excited to announce the launch of Europe PMC Annotations API, which provides programmatic access to annotations text-mined from biomedical abstracts and open access full text articles. The Annotations API is a part of Europe PMC’s programmatic tools suit and is freely available on the Europe PMC website: https://europepmc.org/AnnotationsApi. The exponential growth in scientific data and scholarly content cannot be addressed by conventional means of information discovery. Text-mining offers a practical solution to scale information extraction and advance biomedical research. However its application is still limited, partially due to the technical know-how needed to set up a text-mining pipeline. Nonetheless, even non-specialists can capitalize on the text-mining outcomes. Making the text-mining outputs openly available can enable a broad audience of researchers and developers to address current challenges in biomedical literature analysis. For that reason, Europe PMC has established a community annotation platform. It consolidates text-mined annotations from various providers and makes them available both via the Europe PMC website as text highlights using the SciLite application, and now programmatically, with the Europe PMC Annotations API. Continue reading “Harness the power of text-mining for biomedical discovery: introducing Europe PMC Annotations API”
Each year, millions of books and journals are published. When researchers need to answer a question, how exactly do they find what they need? In the past, researchers used to gather data through screen scraping, which is the process of capturing data from a website using a computer. Today, sophisticated tools allow text and data…
CCC has announced enhancements to RightFind® XML for Mining, a cloud-based solution enabling users to quickly identify and download a corpus of full-text XML articles from more than 50 multiple publishers through a single source. With these new features, CCC enables users with no previous experience to harness the benefits of semantic enrichment, reducing manual processes and increasing efficiency in research.
At the end of last week, I was at a small workshop held by the EXCITE project around the state of the art in extracting references from academic papers (in particular PDFs). This was an excellent workshop that brought together people who are deep into the weeds of this subject including, for example, the developers of ParsCit and CERMINE. While reference string extraction sounds fairly obscure the task itself touches on a lot of the challenges one needs in general for making sense of the scholarly literature.
Begin aside: Yes, I did run a conference called Beyond the PDF 2 and have been known to tweet things like:
But, there’s a lot of great information in papers so we need to get our machines to read. end aside.
You can roughly catergorize the steps of reference extraction as follows:
- Extract the structure of the article. (e.g. find the reference section)
- Extract the reference string itself
- Parsing the reference string into its parts (e.g. authors, journal, issue number, title, …)
Check out these slides from Dominika Tkaczyk that give a nice visual overview of this process. In general, performance on this task is pretty good (~.9 F1) for the reference parsing step but gets harder when including all steps.
There were three themes that popped out for me:
- The reading experience
- Reading from the image
The Reading Experience
Min-Yen Kan gave an excellent talk about how text mining of the academic literature could improve the ability for researchers to come to grips with the state of science. He positioned the field as one where we have the ground work and are working on building enabling tools (e.g. search, management, policies) but there’s still a long way to go in really building systems that give insights to researchers. As custodian of the ACL Anthology about trying to put these innovations into practice. Prof. Kan is based in Singapore but gave probably one of the best skype talks I have ever been part of it. Slides are below but you should check it out on youtube.
Another example of improving the reading experience was David Thorne‘s presentation around some of the newer things being added to Utopia docs – a souped-up PDF reader. In particular, the work on the Lazarus project which by extracting assertions from the full text of the article allows one to traverse an “idea” graph along side the “citation” graph. On a small note, I really like how the articles that are found can be traversed in the reader without having to download them separately. You can just follow the links. As usual, the Utopia team wins the “we hacked something really cool just now” award by integrating directly with the Excite projects citation lookup API.
Finally, on the reading experience front. Andreas Hotho presented BibSonomy the social reference manager his research group has been operating over the past ten years. It’s a pretty amazing success resulting in 23 papers, 160 papers use the dataset, 96 million google hits, ~1000 weekly active users active. Obviously, it’s a challenge running this user facing software from an academic group but clearly it has paid dividends. The main take away I had in terms of reader experience is that it’s important to identify what types of users you have and how the resulting information they produce can help or hinder in its application for other users (see this paper).
The interesting thing about this area is the number of resources available (both software and data) and how resources are also the outcome of the work (e.g. citation databases). Here’s a listing of the open resources that I heard called out:
This is not to mention the more general sources of information like, CiteSeer, ArXiv or PubMed, etc. What also was nice to see is how many systems built on-top of other software. I was also happy to see the following:
An interesting issue was the transparency of algorithms and quality of the resulting citation databases. Nees Jan van Eck from CWTS and developer of VOSViewer gave a nice overview of trying to determine the quality of reference matching in the Web of Science. Likewise, Lee Giles gave a review of his work looking at author disambiguation for CiteSeerX and using an external source to compare that process. A pointer that I hadn’t come across was the work by Jurafsky on author disambiguation:
Michael Levin, Stefan Krawczyk, Steven Bethard, and Dan Jurafsky. 2012. Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology 63:5, 1030-1047.
Reading from the image
In the second day of the workshop, we broke out into discussion groups. In my group, we focused on understanding the role of deep learning in the entire extraction process. Almost all the groups are pursing this.
I was thankful to both Akansha Bhardwaj and Roman Kern for walking us through their pipelines. In particular, Akansha is using scanned images of reference sections as her source and starting to apply CNN’s for doing semantic segmentation where they were having pretty good success.
We discussed the potential for doing the task completely from the ground up using a deep neural network. This was an interesting discussion as current state of the art techniques already use quite a lot of positional information for training This can be gotten out of the pdf and some of the systems already use the images directly. However, there’s a lot of fiddling that needs to go on to deal with the pdf contents so maybe the image actual provides a cleaner place to start. However, then we get back to the issue of resources and how to appropriately generate the training data necessary.
- The organizers set-up a slack backchannel which was useful.
- I’m not a big fan of skype talks, but they were able to get two important speakers that way and they organized it well. When it’s the difference between having field leaders and not, it makes a big difference.
- EU projects can have a legacy – Roman Kern is still using code from http://code-research.eu where Mendeley was a consortium member.
- Kölsch is dangerous but tasty
- More workshops should try the noon to noon format.
Hat tip: Paul Groth
Dario Gil from IBM Research talks about how machines can help with the discovery of new materials. Describes deep parsing of scientific papers.
“Deep parsing of one scientific paper using IBM Text Analytics takes about 30 seconds on a laptop. That means your laptop can analyze over 20,000 papers/week.”
As a publisher, we want to meet the needs of researchers and academics who interact with our site to ensure that we offer the best user experience and functionality possible. With this in mind, we began a project to facilitate text and data mining on emeraldinsight.com, following requests from individuals and institutions in recent months.
So what is text and data mining, or TDM as it is often shortened to? Well, it is the analysis of large bodies of work by a machine, to try and identify trends that would not ordinarily be picked up through usual ‘human’ reading. For example, the processing of data contained in a large collection of scientific papers in a particular medical field could suggest a possible link between a gene and a disease, or between a drug and an adverse effect – things that a human would never piece together after reading thousands of articles.
With so much amazing content on our site, it was an obvious decision to enable this functionality. Hopefully by doing so, it will spark further ideas and research and perhaps even change the world!.. Okay, we are maybe getting a bit ahead of ourselves, but it is still a good thing that it is now available.
Having investigated a number of different options as to how we could do this, we decided to go with a solution that involved the use of CrossRef’s TDM facility. This meant adding additional data into current and future deposits with CrossRef, along with depositing a huge tranche of historical information. So far, we have provided data for over 200,000 articles, and this number will continue to grow over forthcoming weeks. We have also enabled access to the equivalent number of machine-readable files on our site.
Users wishing to mine the site are encouraged to inform us of their intention to do so, so they are not automatically blocked by our system. There are also the usual access restrictions in place, so a user will still have to be a subscriber to the content. But aside from those minor caveats, we encourage our users to use the facility and mine for that one diamond of information that is just waiting to be discovered.
We got some great feedback from reviewers our new Sloan grant, including a suggestion that we be more transparent about our process over the course of the grant. We love that idea, and you’re now reading part of our plan for how to do that: we’re going to be blogging a lot more about what we learn as we go.
A big part of the grant is using machine learning to automatically discover mentions of software use in the research literature. It’s going to be a really fun project because we’ll get to play around with some of the very latest in ML, which currently The Hotness everywhere you look. And we’re learning a lot as we go. One of the first questions we’ve tackled (also in response to some good reviewer feedback) is: how big does our training set need to be? The machine learning system needs to be trained to recognized software mentions, and to do that we need to give it a set of annotated papers where we, as humans, have marked what a software mention looks like (and doesn’t look like). That training set is called the gold standard. It’s what the machine learning system learns from. Below is copied from one of our reviewer responses:
We came up with the number of articles to annotate through a combination of theory, experience, and intuition. As usual in machine learning tasks, we considered the following aspects of the task at hand:
- prevalence: the number of software mentions we expect in each article
- task complexity: how much do software-mention words look like other words we don’t want to detect
- number of features: how many different clues will we give our algorithm to help it decide whether each word is a software mention (eg is it a noun, is it in the Acknowledgements section, is it a mix of uppercase and lowercase, etc)
None of these aspects are clearly understood for this task at this point (one outcome of the proposed project is that we will understand them better once we are done, for future work), but we do have rough estimates. Software mention prevalence will be different in each domain, but we expect roughly 3 mentions per paper, very roughly, based on previous work by Howison et al. and others. Our estimate is that the task is moderately complex, based on the moderate f-measures achieved by Pan et al. and Duck et al. with hand-crafted rules. Finally, we are planning to give our machine learning algorithm about 100 features (50 automatically discovered/generated by word2vec, plus 50 standard and rule-based features, as we discuss in the full proposal).
We then used these estimates. As is common in machine learning sample size estimation, we started by applying a rule-of-thumb for the number of articles we’d have to annotate if we were to use the most simple algorithm, a multiple linear regression. A standard rule of thumb (see https://en.wikiversity.org/wiki/Multiple_linear_regression#Sample_size) is 10-20 datapoints are needed for each feature used by the algorithm, which implies we’d need 100 features * 10 datapoints = 1000 datapoints. At 3 datapoints (software mentions) per article, this rule of thumb suggests we’d need 333 articles per domain.
From there we modified our estimate based on our specific machine learning circumstance. Conditional Random Fields (our intended algorithm) is a more complex algorithm than multiple linear regression, which might suggest we’d need more than 333 articles. On the other hand, our algorithm will also use “negative” datapoints inherent in the article (all the words in the article that are *not* software mentions, annotated implicitly as not software mentions) to help learn information about what is predictive of being vs not being a software mention — the inclusion of this kind of data for this task means our estimate of 333 articles is probably conservative and safe.
Based on this, as well as reviewing the literature for others who have done similar work (Pan et al. used a gold standard of 386 papers to learn their rules, Duck et al. used 1479 database and software mentions to train their rule weighting, etc), we determined that 300-500 articles per domain was appropriate. We also plan to experiment with combining the domains into one general model — in this approach, the domain would be added as an additional feature, which may prove more powerful overall. This would bring all 1000-1500 articles to the test set.
Finally, before proposing 300-500 articles per domain, we did a gut-check whether the proposed annotation burden was a reasonable amount of work and cost for the value of the task, and we felt it was.
Duck, G., Nenadic, G., Filannino, M., Brass, A., Robertson, D. L., & Stevens, R. (2016). A Survey of Bioinformatics Database and Software Usage through Mining the Literature. PLOS ONE, 11(6), e0157989. http://doi.org/10.1371/journal.pone.0157989
Howison, J., & Bullard, J. (2015). Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. Journal of the Association for Information Science and Technology (JASIST), Article first published online: 13 MAY 2015. http://doi.org/10.1002/asi.23538
Pan, X., Yan, E., Wang, Q., & Hua, W. (2015). Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics, 9(4), 860–871. http://doi.org/10.1016/j.joi.2015.07.012
The post How big does our text-mining training set need to be? appeared first on Impactstory blog.