Early adopters of the OpenCitations Data Model

OpenCitations is very pleased to announce its collaboration with four new scholarly Research and Development projects that are early adopters of the recently updated OpenCitations Data Model, described in this blog post.

The four projects are similar, in that they each are independently using text mining and optical character recognition or PDF extraction techniques to extract citation information from the reference lists of published works, and are making these citations available as Linked Open Data. Three of the four will also use the OpenCitations Corpus as publication platform for their citation data.  The academic disciplines from which these citation data are being extracted are social science, humanities and economics. Continue reading “Early adopters of the OpenCitations Data Model”

Summary: Citations play an important role in scientific discourse, in the practice of information retrieval, and in bibliometrics. Recently, there have been a growing number of initiatives which make citations freely available as open data. The article describes the current status of these initiatives and shows that a critical mass of data could be made available in the near future. New opportunities could arise from that, especially for libraries. The DFG funded project Linked Open Citation Database (LOC-DB) is presented as a practical way for libraries to participate.

Clarivate Analytics’ citation data now integrated into DeepDyve’s online rental service for scientific and scholarly research

and have announced that Clarivate Analytics’ citation data is now integrated into DeepDyve’s online rental service for scientific and scholarly research. Articles from the Web of Science that appear on DeepDyve will now include a Times Cited feature indicating how many times those articles have been cited. These real-time citation metrics, powered by Web of Science, enable users to quickly assess the authority and impact of a given article and focus their research efforts accordingly.

The Sloan Foundation funds OpenCitations

The OpenCitations Enhancement Project funded by Sloan

The Alfred P. Sloan Foundation, which funds research and education in science, technology, engineering, mathematics and economics, including a number of key technology projects relating to scholarly communication, has agreed to fund The OpenCitations Enhancement Project, a new project to develop and enhance the OpenCitations Corpus.

As readers of this blog will know, the OpenCitations Corpus is an open scholarly citation database that freely and legally makes available accurate citation data (academic references) to assist scholars with their academic studies, and to serve knowledge to the wider public.

Objectives

The OpenCitations Enhancement Project, funded by the Sloan Foundation for 18 months from May 2017, will make the OpenCitations Corpus (OCC) more useful to the academic community both by significantly expanding the volume of citation data held within the Corpus, and by developing novel data visualizations and query services over the stored data.

At OpenCitations, we will achieve these objectives in the following ways:

(a) By establishing a new powerful physical server to handle the Corpus data and offer adequate performance for query services.

(b) By increasing the rate of data ingest into the Corpus, by integrating with server 30 small data-ingest computers, Raspberry Pi 3Bs, working in parallel to harvest references, thus increasing the current rate of corpus data ingest some thirty-fold to about half a million citation links per day.

(c) By employing a post-doctoral computer science research engineer specifically to develop information visualisation interfaces and sense-making tools that will both provide smart ways of envisaging and comprehending the citation data stored within the OpenCitations Corpus, and will also ease the task of manual curation of the OCC.

Personnel

This post-doctoral appointment will start in the autumn of 2017, once the new hardware has been commissioned and programmed. We seek a highly intelligent, skilled and motivated individual who is an expert in Web Interface Design and Information Visualization, and who can demonstrate a commitment to increasing the openness of scholarly information. A formal advertisement for this post, which will be held at the University of Bologna in Italy under the supervision of Dr Silvio Peroni, will be published in the near future. In the mean time, individuals with the relevant skills and background who would like to express early interest in joining the OpenCitations team in this role should contact him by e-mail to <silvio.peroni@opencitations.net>.

Expected Outcomes

By the end of the OpenCitations Enhancement Project, we will have harvested approximately 190 million citation links obtained from the reference lists of about 4.4 million scholarly articles (~15% of Web of Science’s coverage). In this way, in a significant initial step towards the comprehensive literature coverage we seek for the OCC, we will establish the OpenCitations Corpus as a valuable and persistent free-to-use global scholarly on-line Linked Open Data service.

In so doing, we aim at empower the global community by liberating scholarly citation data from their current commercial shackles, publishing such data with a Creative Commons CC0 Public Domain Dedication that will enable novel third-party services to be built over them.

PubMed: Redesigning citation data management

Over the last couple years, we have drastically changed the systems and process used to manage PubMed citation data. It began with revising long-standing NLM policies and reducing reliance on manual citation corrections, then culminated with the release of the PubMed Data Management (PMDM) system in October 2016. With PMDM, we introduced a single system for managing citation data with a UI for editing citation data. In this brave new world, the responsibility for correcting citation data shifted from NLM Data Review to PubMed data providers. Any errors reported in PubMed citations are now forwarded to the publisher ― a strategy that publishers have enthusiastically upheld. Here, we outline how the systems and process for managing PubMed citation data have changed, and detail the outcome of these changes since PMDM was launched.

Read full story

Three publications describing the Open Citations Corpus

Last September, I attended the Fifth Annual Conference on Open Access Scholarly Publishing, held in Riga, at which I had been invited to give a paper entitled The Open Citations Corpus – freeing scholarly citation data.  A recording of my talk is available here, and my PowerPoint presentation is separately available here.  My own reflections on the major themes of the conference are given in a separate Semantic Publishing Blog post.

While in Riga preparing to give that talk about the importance of open citation data, I received an invitation from Sara Abdulla, Chief Commissioning Editor at Nature, to write a Comment piece for their forthcoming special issue on Impact.  My immediate reaction was that this should be on the same theme, an idea to which Sara readily agreed.  The deadline for delivery of the article was 10 days later!

As soon as the Riga conference was over, I first assembled all the material I had to hand that could be relevant to describing the Open Citations Corpus (OCC) in the context of conventional access to academic citation data from commercial sources.  That gave me a raw manuscript of some five thousand words, from which I had to distil an article of less than 1,300 words.  I then started editing, and asked my colleagues Silvio Peroni and Tanya Gray for their comments.

The end result, enriched by some imaginative art work by the Nature team, was published a couple of weeks later on 16th October [1], and presents both the intellectual argument for open citation data, and the practical obstacles to be overcome in achieving the goal of a substantial corpus of such data, as well as giving a general description of the Open Citations Corpus itself and of the development work we have planned for it.

Because of the drastic editing required to reduce the original draft to about a quarter of its size, all material not crucial to the central theme had to be cut.  I thus had the idea of developing the original draft subsequently into a full journal article that would include these additional themes, particularly Silvio’s work on the SPAR ontologies described in this Semantic Publishing Blog post [2], Tanya’s work on the CiTO Reference Annotation Tools described in this Semantic Publishing Blog post, and a wonderful analogy between the scholarly citation network and Venice devised by Silvio.  I also wanted to give authorship credit to Alex Dutton, who had undertaken almost all of the original software development work for the OCC.  For this reason, instead of assigning copyright to Nature for the Comment piece, I gave them a license to publish, retaining copyright to myself so I could re-use the text.  I am pleased to say that they accepted this without comment.

Silvio and I then set to work to develop the draft into a proper article.  The result was a ten-thousand word paper submitted to the Journal of Documentation a week before Christmas [3].  We await the referees’ comments!

 References

[1]     Shotton D. (2013).  Open citations.  Nature 502: 295–297. http://www.nature.com/news/publishing-open-citations-1.13937. doi:10.1038/502295a.

[2]     Peroni S and Shotton D (2012). FaBiO and CiTO: ontologies for describing bibliographic resources and citations. Web Semantics: Science, Services and Agents on the World Wide Web. 17: 33-34. doi:10.1016/j.websem.2012.08.001.

[3]    Silvio Peroni, Alexander Dutton, Tanya Gray, David Shotton (2015). Setting our bibliographic references free: towards open citation data. Journal of Documentation, 71 (2): 253-277. http://dx.doi.org/10.1108/JD-12-2013-0166; OA at http://speroni.web.cs.unibo.it/publications/peroni-2015-setting-bibliographic-references.pdf

This is the main article about OpenCitations, which includes several background information and the main ideas and works supporting the whole project, the Corpus, and some possible future developments in terms of new kinds of data to be included, e.g. citation functions.

 

Open Citations Corpus Import Process

As part of the Open Citations project, we have been asked to review and improve the process of importing data into the Open Citations Corpus, taking the scripts from the initial project as our starting point.

The current import procedure evolved from several disconnected processes and requires running multiple command line scripts and transforming the data into different intermediate formats. As a consequence, it is not very efficient and we will be looking to improve on the speed and reliability of the import procedure. Moreover, there are two distinct procedures depending on the source of the data (arXiv or PubMed Central); we are hoping to unify the common parts of these procedures into a single process which can be simplified and normalised to improve code re-use and comprehensibility.

The Workflow

As PubMed Central provides an OAI-PMH feed, this could be used to retrieve article metadata, and for some articles, full text. Using this feed, rather than an FTP download (as used currently) would allow the metadata import for both arXiv and PubMed Central to follow a near-identical process, as we are already using the OAI-PMH feed for arXiv.

Also, rather than have intermediate databases and information stores, it would be cleaner to import from the information source straight into a datastore. The datastore could then be queried, allowing matches and linking between articles to be performed in situ. The process would therefore become:

  1. Pull new metadata from arXiv (OAI-PMH) and PubMed Central (OAI-PMH) and insert new records into the Open Citations Corpus datastore
  2. Pull new full-text from arXiv and PubMed Central, extract citations, and match with article data in Open Citations server, creating links between these references and the metadata records for the cited articles. Store unmatched citations as nested records in the metadata for each article.
  3. On a scheduled basis (e.g. nightly), review each existing article’s unmatched citations and attempt to match these with existing bibliographic records of other articles.

In outline, this looks like this:

The Datastore

Neo4J is currently used as the final Open Citations Corpus datastore for the arXiv data, by the Related Work system. We propose instead to use BibServer as the final datastore, for its flexibility and scalability, and suitability for the Open Citations use cases.

The Data Structure

The data stored within BibServer as BibJSON will be a collection of linked bibliographic records describing articles. Associated with each record and stored as nested data will be a list of matched citations (i.e. those for which the Open Citations Corpus has a bibliographic record), a list of unmatched citations, and a list of authors.

Authors will not be stored as separate entities. De-coupling and de-duplicating authors and articles could form the basis of a future project, perhaps using proprietary identifiers (such as ORCHID, PubMed Author ID or arXiv Author ID) or email addresses, but this will not be considered further in this work package.

Overall Aim

The overall aim of this work is to provide a consistent, simple and re-usable import pipeline for data for the Open Citations Corpus. In the fullness of time we’d expect it to be possible to add new data sources with minimal additional complexity. By using an approach whereby data is imported into the datastore at as early a stage as possible in the import pipeline, we can use common tools for extracting, matching, deduplicating citations; the work for each datasource, then, is just to convert the source data format into BibJSON and store it in BibServer.

Postscript

David Shotton writes: This productive collaboration between Cottage Labs and the Open Citations Corpus came to an end when Jisc funding ran out.  The corpus has more recently been given a new lease of life, as described here, with a new instantiation named OpenCitations hosted at the Department of Computer Science and Engineering of the University of Bologna, with Silvio Peroni as Co-Director.

Open Citations – Indexing PubMed Central OA data

As part of our work on the Open Citations extensions project, I have recently been doing one of my favourite things – namely indexing large quantities of data then exploring it.

On this project we are interested in the PubMed Central Open Access subset, and more specifically, we are interested in what we can do with the citation data contained within the records that are in that subset – because, as they are open access, that citation data is public and freely available.

We are building a pipeline that will enable us to easily import data from the PMC OA and from other sources such as arXiv, so that we can do great things with it like explore it in a facetview, manage and edit it in a bibserver, visualise it, and stick it in the rather cool related-work prototype software. We are building on the earlier work of both the original Open Citations project, and of the Open Bibliography projects.

Work done so far

We have spent a few weeks getting to understand the original project software and clarifying some of the goals the project should achieve; we have put together a design for a processing pipeline to get the data from source right through to where we need it, in the shape that we need it. In the case of facetview / bibserver work, this means getting it into a wonderful elasticsearch index.

While Martyn continues work on the bits and pieces for managing the pipeline as a whole and pulling data from arXiv, I have built an automated and threadable toolchain for unpacking data out of the compressed file format it arrives in from the US National Institutes of Health, parsing the XML file format and converting it into BibJSON, and then bulk loading it into an elasticsearch index. This has gone quite well.

To fully browse what we have so far, check out http://occ.cottagelabs.com.

For the code: https://github.com/opencitations/OpenCitationsCorpus/tree/master/pipeline.

The indexing process

Whilst the toolchain is capable of running threaded, the server we are using only has 2 cores and I was not sure to what extent they would be utilised, so I ran the process singular. It took five hours and ten minutes to build an index of the PMC OA subset, and we now have over 500,000 records. We can full-text search them and facet browse them.

Some things of particular interest that I learnt – I have an article in the PMC OA! And also PMIDs are not always 8 digits long – they appear in fact to be incremental from 1.

What next

At the moment there is no effort made to create record objects for the citations we find within these records, however plugging that into the toolchain is relatively straightforward now.

The full pipeline is of course still in progress, and so this work will need a wee bit of wiring into it.

Improve parsing. There are probably improvements to the parsing that we can make too, and so one of the next tasks will be to look at a few choice records and decide how better to parse them. The best way to get a look at the records for now is to use a browser like Firefox or Chrome and install the JSONview plugin, then go to occ.cottagelabs.com and have a bit of a search, then click the small blue arrows at the start of a record you are interested in to see it in full JSON straight from the index. Some further analysis on a few of these records would be a great next step, and should allow for improvements to both the data we can parse and to our representation of it.

Finish visualisations. Now that we have a good test dataset to work with, the various bits and pieces of visualisation work will be pulled together and put up on display somewhere soon. These, in addition to the search functionality already available, will enable us to answer the questions set as representative of project goals earlier in January (thanks David for those).

Postscript

David Shotton writes: This productive collaboration between Cottage Labs and the Open Citations Corpus came to an end when Jisc funding ran out.  The corpus has more recently been given a new lease of life, as described here, with a new instantiation named OpenCitations hosted at the Department of Computer Science and Engineering of the University of Bologna, with Silvio Peroni as Co-Director.

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑