Early adopters of the OpenCitations Data Model

OpenCitations is very pleased to announce its collaboration with four new scholarly Research and Development projects that are early adopters of the recently updated OpenCitations Data Model, described in this blog post.

The four projects are similar, in that they each are independently using text mining and optical character recognition or PDF extraction techniques to extract citation information from the reference lists of published works, and are making these citations available as Linked Open Data. Three of the four will also use the OpenCitations Corpus as publication platform for their citation data.  The academic disciplines from which these citation data are being extracted are social science, humanities and economics. Continue reading “Early adopters of the OpenCitations Data Model”

Citations as First-Class Data Entities: The OpenCitations Data Model

Requirements for citations to be treated as First-Class Data Entities

In my introductory blog post, I listed five requirements for the treatment of citations as first-class data entities.  The second of these requirements is that they must have metadata structured using a generic yet appropriately detailed data model.

To fulfil that requirement, OpenCitations is pleased to announce the publication on 13 February 2018 of the OpenCitations Data Model, v1.6 [1].  This replaces the previous version, v1.5.3, published on 13 July 2016. Continue reading “Citations as First-Class Data Entities: The OpenCitations Data Model”

Citations as First-Class Data Entities: Introduction

Citations are now centre stage

As a result of the Initiative for Open Citations (I4OC), launched on April 6 last year, almost all the major scholarly publishers now open the reference lists they submit to Crossref, resulting in more than half a billion references being openly available via the Crossref API.

It is therefore time to think carefully about how citations are treated, and how they might be better handled as part of the Linked Open Data Web. Continue reading “Citations as First-Class Data Entities: Introduction”

OpenCitations and the Initiative for Open Citations: A Clarification

Some folk are confused, but OpenCitations and the Initiative for Open Citations, despite the similarity of their names, are two distinct organizations.

OpenCitations (http://opencitations.net) is an open scholarly infrastructure organization directed by Silvio Peroni and myself, and its primary purpose is to host and build the OpenCitations Corpus (OCC), an RDF database of scholarly citation data that now contains almost 13 million citation links. Continue reading “OpenCitations and the Initiative for Open Citations: A Clarification”

The Sloan Foundation funds OpenCitations

The OpenCitations Enhancement Project funded by Sloan

The Alfred P. Sloan Foundation, which funds research and education in science, technology, engineering, mathematics and economics, including a number of key technology projects relating to scholarly communication, has agreed to fund The OpenCitations Enhancement Project, a new project to develop and enhance the OpenCitations Corpus.

As readers of this blog will know, the OpenCitations Corpus is an open scholarly citation database that freely and legally makes available accurate citation data (academic references) to assist scholars with their academic studies, and to serve knowledge to the wider public.


The OpenCitations Enhancement Project, funded by the Sloan Foundation for 18 months from May 2017, will make the OpenCitations Corpus (OCC) more useful to the academic community both by significantly expanding the volume of citation data held within the Corpus, and by developing novel data visualizations and query services over the stored data.

At OpenCitations, we will achieve these objectives in the following ways:

(a) By establishing a new powerful physical server to handle the Corpus data and offer adequate performance for query services.

(b) By increasing the rate of data ingest into the Corpus, by integrating with server 30 small data-ingest computers, Raspberry Pi 3Bs, working in parallel to harvest references, thus increasing the current rate of corpus data ingest some thirty-fold to about half a million citation links per day.

(c) By employing a post-doctoral computer science research engineer specifically to develop information visualisation interfaces and sense-making tools that will both provide smart ways of envisaging and comprehending the citation data stored within the OpenCitations Corpus, and will also ease the task of manual curation of the OCC.


This post-doctoral appointment will start in the autumn of 2017, once the new hardware has been commissioned and programmed. We seek a highly intelligent, skilled and motivated individual who is an expert in Web Interface Design and Information Visualization, and who can demonstrate a commitment to increasing the openness of scholarly information. A formal advertisement for this post, which will be held at the University of Bologna in Italy under the supervision of Dr Silvio Peroni, will be published in the near future. In the mean time, individuals with the relevant skills and background who would like to express early interest in joining the OpenCitations team in this role should contact him by e-mail to <silvio.peroni@opencitations.net>.

Expected Outcomes

By the end of the OpenCitations Enhancement Project, we will have harvested approximately 190 million citation links obtained from the reference lists of about 4.4 million scholarly articles (~15% of Web of Science’s coverage). In this way, in a significant initial step towards the comprehensive literature coverage we seek for the OCC, we will establish the OpenCitations Corpus as a valuable and persistent free-to-use global scholarly on-line Linked Open Data service.

In so doing, we aim at empower the global community by liberating scholarly citation data from their current commercial shackles, publishing such data with a Creative Commons CC0 Public Domain Dedication that will enable novel third-party services to be built over them.

ResearchGate and Microsoft academic search (beta) – new rising citation indexes?

As every librarian knows, there are three main sources of citation data. The three citation indexes (in increasing order of size) are Web of Science, Scopus and Google Scholar.

However, they are not the only sources, and recently, I noticed studies showing that two other sources, ResearchGate and Microsoft Academic search are getting large enough to be worth considering.
Could they possibly complement Google Scholar to serve as alternatives to paid indexes?


While Mendeley offers “readers” as a statistic (basically the number of people who have a paper in their reference library), their citation data comes directly from Scopus.
Mendeley’s citations are supplied by Scopus
In contrast, ResearchGate perhaps the biggest social networking site for academics provides their own citation metrics
But how good is this citation index? A recent study in Scientometrics (OA version) studying 86 Information Science and Library Science  journals from January 2016 to March 2017 found that while ResearchGate found less citations than Google Scholar , it generally found more citations than the paid indexes like Web of Science or Scopus.
Thelwall, M., & Kousha, K. (2017). ResearchGate versus Google Scholar: Which finds more early citations?. Scientometrics, 1-7.
Another interesting finding is that the correlation between Google Scholar and ResearchGate is actually higher than that between Web of Science and Scopus.
Thelwall, M., & Kousha, K. (2017). ResearchGate versus Google Scholar: Which finds more early citations?. Scientometrics, 1-7.
These results were quite surprising to me. While I was aware ResearchGate was pretty much the largest source of free full-text articles (not counting piracy sites) often appearing as a source for free articles in Google Scholar , and in addition also the biggest social networking site for academics,  it seems unlikely to me that any citation index based on material uploaded to ResearchGate would be anyway near complete. The key seems to be that while a lot of papers uploaded to ResearchGate are currently illegal , publishers seem to have mostly not attempted take-down notices on them (at least not recently).


Honestly, I’m not sure why the publishers are not currently acting (though they acted a few years ago with academia.edu), but this is perhaps the greatest sticking point of using ResearchGate as a citation index due to the instability of the situation.

But is there anything better?

Microsoft Academic search (new version) 

Microsoft’s alternative to Google Scholar , Microsoft Academic Search is now in it’s 2nd iteration may fit the bill.

Some of you might remember the earlier version, which seemed to be abandoned in 2014 with analysis revealing that Microsoft seem to have stopped indexing new articles.

Fortunately, Microsoft has since clarified that they were actually working on a new version of the service.

They then launched a preview version of this new Microsoft academic service about a year ago.


As you will see later, many bibliometrics experts are very excited by this new service. But why?

Microsoft describes it as thus in a 2015 paper “At the core of MAS is a heterogeneous entity graph comprised of six types of entities that model the scholarly activities: field of study, author, institution, paper, venue, and event. In addition to obtaining these entities from the publisher feeds as in the previous effort, we in this version include data mining results from the Web index and an in-house knowledge base from Bing, a major commercial search engine.”

In short, it seems to offer the best of both worlds, using Google Scholar type crawling technology combined with publisher based metadata feeds to build large indexes with an attention to metadata fields similar to what you get from library type databases. On top of that, unlike the well known difficulty of extracting data from Google Scholar, you can do so easily in Microsoft Academic Search via an API, or downloading Microsoft Academic Graph (MAG).

I’ve been playing around with the service, setting up my own profile, kicking the tires etc.

My profile in the new Microsoft academic search

I’m not ready to do a full review yet, but it does look promising despite the bugs.

Some preliminary things I noticed

Search wise the number of results generated  for searches seems closer to what you would see search in library databases. For example a search for terms web scale discovery gets you around 111 results, but you get 2.6 million in Google Scholar!  It seems unlikely that Microsoft academic search index is smaller by such a degree (see later) ,  so it is probably because it does not search and match within full text (for whatever technical reason).

And this was confirmed via Twitter

The other major difference between it and Google Scholar is that Google Scholar shows [citation] results or items that are not indexed, while Microsoft academic search does not.

[Citation] results in Google Scholar

All this perhaps explains some of the reasons why you get fewer results compared to Google Scholar.

The size of Microsoft Academic search versus the rest

Microsoft claims an index of 83 million publications record in 2015, and by 2016, this rose to 140 million publication records, 40 million authors, 60,000 journal titles. As estimates for the size of Google Scholar’s index typically fall into the 100+ million range (it’s notoriously hard to get any hard facts on the size of Google Scholar) , Microsoft is now seemingly hitting within the same ballpark and is significantly bigger than Scopus and Web of Science , which is perhaps 60%-70% of it’s size.

But that’s what is claimed, what does the research by Harzing and other researchers show?

Harzing of course is well known as the author of the free “Publish or Perish” tool, the only tool allowed by Google to extract citation data from Google Scholar.  She has now added support to version 5.0 for Microsoft Academic Search.

In blog posts such as Microsoft Academic (Search): a Phoenix arisen from the ashes? and peer reviewed article “Microsoft Academic: Is the Phoenix getting wings?(http://www.harzing.com/download/mas2.pdfPublisher’s version), she shows that the coverage of Microsoft Academic Search more than equals the coverage of  the paid indexes of Scopus and Web of Science. While it still falls behind that of Google Scholar the gap is closing.



For example, her blog post that studies coverage of her own works finds that practically all her publications that is also indexed in Scopus (all but 2) and Web of Science (all but 1) is also in Microsoft Academic Search. On top of that, Microsoft Academic Search can find 30/43 more of her works than Scopus and Web of Science respectively. Google Scholar still dominates Microsoft Academic Search though.

In her more comprehensive study , of 145 Associate Professors and Full Professors at the University of Melbourne she studies citation counts to works of these authors.

In general, Scopus & Web of Science detects slightly more citations than Microsoft Academic Search in the Life Sciences (11% more) , Sciences (7% more) & pretty much ties for Engineering but Microsoft Academic Search beats the other two handily in Humanities (170% of Scopus) and Social Sciences (145% of Scopus). Google Scholar clearly dominates all as usual.

Hazing goes on to explain that Microsoft Academic Search uses machine learning to drop citations that it can’t verify that is a true cite and attempts to correct for this to “estimate “true” citation counts”. This leads to the following comparison.

When looking at this estimated true citation count (MA ECC) , Microsoft Academic Search actually finds more citations than Google Scholar in Life Sciences and just barely loses out in Science & Engineering.
But Google Scholar continues to dominate in Social Sciences and particularly Humanities. This is probably due to the impact of Google books for book related items.

I could go on to describe the results from the thesis, Comparison of Microsoft Academic (Graph) ,with Web of Science, Scopus and Google Scholar or the paper, The coverage of Microsoft Academic: Analyzing the publication output of a university  but results are pretty much similar with the latter describing the service “is on the verge of becoming a bibliometric superpower.”

Completeness of metadata fields

But I suspect, the interest in Microsoft Academic Search is not just purely based on the size of index. After all Google Scholar still seems to have the edge in size.

But interest here lies in the fact that the service is now sufficiently big and also the richness of the metadata and the ease of extraction of the data, both areas Google Scholar is extremely poor at. The only official licensed tool by Google, Publish or Perish is often unreliable and cannot be used for large scale extraction for example.

It might be worth reading the thesis, Comparison of Microsoft Academic (Graph) with Web of Science, Scopus and Google Scholar , particularly chapter 4 that compares the openness of accessing data of the 3 sources and the completeness of metadata fields.

Similarly,  Citation Analysis with Microsoft Academic goes in depth to assess the suitability of Microsoft Academic Search as a bibliometric tool in terms of the completeness of metadata field and easy accessibility of data for extraction.

In general, the results are positive, it’s far easier to extract & manipulate data than Google Scholar through the API or by downloading the Microsoft Academic graph which has a much richer and structured data available than Google Scholar. Even something like having internal Microsoft Academic Search assigned ids for “papers, references, authors, affiliations, fields of study, journals and venues” is very helpful.

It’s not a perfect tool though, for example examining the attributes available they realize that there is no document type (which makes metrics that normalizes using document type hard to do), nor does it have the very obvious doi attribute (a strange omission). While there is a subject type “field of study”, it’s dynamically generated and far too specific (50,000 field of studies?).


Of course for ordinary users who just want to calculate their citation counts, you can use Hazings’ Publish Or Perish V5 and above with a free API Key from Microsoft to mine Microsoft Academic search.



Currently our citation sources consists of either paid services like Scopus and Web of Science, or free to access services by commercial companies – Google Scholar and Microsoft Academic Search. Both are not ideal.

The OpenCitations Project is probably the best solution but as of writing there is no study I know of quantifying the size of this index.

Still, one wonders if it might be the beginning of the end for paid citation indexes. Use of Scopus and Web of Science as discovery tools have greatly declined in recent years and much of its value now lies mostly in generating citation metrics.

As open access continues to march on, more and more content will be freely available. This will free up citations/references as well to be mined (albeit not always in structured format), so citation indexes will have to compete on data quality, feature sets and ease of use.

Players that have strengths in handling and cleaning of large datasets (e.g. Google) will have a big edge here of course. Traditional companies that serve libraries and academia may not be able to match this but do have strengths in terms of better understanding of academics so it’s going to be interesting to watch.

Acknowledgements : As always hat-tip to the very informative Google Scholar digest twitter account for alerting me to these studies.

Querying the OpenCitations Corpus

OpenCitations makes available a SPARQL endpoint for querying the data included in the OpenCitations Corpus. While several queries are possible according to the model described in the website (and, with more details, in the official metadata document of the Corpus), we have received some requests by users of the service for exemplar queries. We have chosen two of them, which are particularly relevant with regard to the work that has been done in the past months by the Initiative for Open Citations – that we have already introduced in another blog post.

Query: return all the papers (including their titles) citing the article with DOI “10.1038/227680a0”.

PREFIX cito: <http://purl.org/spar/cito/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX datacite: <http://purl.org/spar/datacite/>
PREFIX literal: <http://www.essepuntato.it/2010/06/literalreification/>
SELECT ?citing ?title WHERE {
  ?id a datacite:Identifier ;
    datacite:usesIdentifierScheme datacite:doi ;
    literal:hasLiteralValue "10.1038/227680a0" .
    datacite:hasIdentifier ?id ;
    ^cito:cites ?citing .
  ?citing dcterms:title ?title

Query: return all the papers cited by the bibliographic resource “br/4186” included in the OCC, including the text of bibliographic references used in “br/4186” for making the citations and the titles of the cited papers.

PREFIX cito: <http://purl.org/spar/cito/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX biro: <http://purl.org/spar/biro/>
PREFIX frbr: <http://purl.org/vocab/frbr/core#>
PREFIX c4o: <http://purl.org/spar/c4o/>
SELECT ?cited ?cited_ref ?title WHERE {
  <https://w3id.org/oc/corpus/br/4186> cito:cites ?cited .
    <https://w3id.org/oc/corpus/br/4186> frbr:part ?ref .
    ?ref biro:references ?cited ;
      c4o:hasContent ?cited_ref 
  OPTIONAL { ?cited dcterms:title ?title }

The Initiative for Open Citations

OpenCitations are pleased to announce the launch of the Initiative for Open Citations (I4OC) , a fresh momentum in the scholarly publishing world to open up data on the citations that link research publications.  OpenCitations are proud to be a founder of I4OC, and we encourage those remaining publishers whose journal article reference lists are still closed to embrace this sea change in attitude towards open citation data. The other I4OC founding organizations are Wikimedia Foundation, PLOS, eLife, DataCite, and the Centre for Culture and Technology at Curtin University,

Until recently, the vast majority of citation data were not openly available, even though all major publishers freely share their article metadata through Crossref. Before I4OC started, only about 1% of the reference data deposited in Crossref were freely available. Today that figure has jumped to 40% [1].


In recent months, following earlier indications of willingness reported in this blog, several publishers have made the decision to release these metadata publicly, including the American Geophysical Union, Association for Computing Machinery, BMJ, Cambridge University Press, Cold Spring Harbor Laboratory Press, EMBO Press, Royal Society of Chemistry, SAGE Publishing, Springer Nature, Taylor & Francis, and Wiley. These publishers join other publishers who have been opening their references through Crossref for some time. The full list of scholarly publishers now opening their reference data via Crossef is given in [2].

These decisions stem from discussions that have been taking place since a call-to-action to open up citations was made by Dario Taraborelli of the Wikimedia Foundation at the 2016 OASPA Conference on Open-Access Publishing. The creation of I4OC was spearheaded by Jonathan Dugan, Martin Fenner, Jan Gerlach, Catriona MacCallum, Daniel Mietchen, Cameron Neylon, Mark Patterson, Michelle Paulson, Silvio Peroni, myself and Dario Taraborelli. The purpose of I4OC is to coordinate these efforts and to promote the creation of a comprehensive, freely-available corpus of scholarly citation data.


Such a corpus will be valuable for new as well as existing services, and will allow many more interested parties to explore, mine, and reuse the data for new knowledge. The key benefits that arise from a fully open citation dataset include:

  1. The establishment of a global public web of linked scholarly citation data to enhance the discoverability of published content, both subscription access and open access. This will particularly benefit individuals who are not members of academic institutions with subscriptions to commercial citation databases.
  2. The ability to build new services over the open citation data, for the benefit of publishers, researchers, funding agencies, academic institutions and the general public, as well as to enhancing existing services.
  3. The creation of a public citation graph to explore connections between knowledge fields, and to follow the evolution of ideas and scholarly disciplines.


The Internet Archive, Mozilla, the Wellcome Trust, and twenty eight other projects and organizations have formally put their names behind I4OC as stakeholders in support of openly accessible citations. The full list of stakeholders is given in [3].


Dario Taraborelli, Head of Research at the Wikimedia Foundation, said:

“Citations are the foundation for how we know what we know. Today, tens of millions of scholarly citations become available to the public with no copyright restriction. We look forward to more organizations joining this initiative to release, and build on these data.”

Liz Ferguson, VP Publishing Development, Wiley, said:

“Wiley is delighted to support I4OC by opening our citation metadata via Crossref. Collaborating with other publishers further contributes to sustainable and standardized infrastructure that will benefit the research community. We are particularly excited by the potential to expose networks of research that would otherwise lie hidden or take years to discover.”

Robert Kiley, Head of Open Research at the Wellcome Trust, said:

“The open availability of citation data will help all funders better evaluate the research they fund. The progress that I4OC has made is an essential first step and we encourage all publishers to publicly share this data.”

Mark Patterson, Executive Director of eLife, said:

“It’s fantastic to see the interest that’s being shown by so many publishers in making their reference list metadata publicly available. We hope that this new momentum will encourage all publishers to follow suit, and that new services and tools can be built around these open data.”

Catriona MacCallum Advocacy Director, PLOS, said:

“Creating an open database of citations will allow researchers to perform independent analyses of how scientific ideas are communicated through article citations, and a transparent way of tracking the influence of particular articles. By opening up these metadata via Crossref, publishers are providing a vital contribution to Open Science.”

Future growth

Many other publishers have expressed interest in opening up their reference data. They can do this very easily via Crossref, with a simple email to support@crossref.org requesting they turn on reference distribution for all their DOI prefixes. This is required even for publishers of open access articles, since by default references submitted to the Crossref Cited-By Linking service are closed, as previously explained here.  I4OC will provide regular updates on the growth of the public citation corpus, how the data are being used, additional stakeholders and participating publishers as they join, and as new services are developed.

I4OC and OpenCitations

Through the efforts of I4OC, scholarly citation data will be increasingly available to any interested party through all of Crossref’s Metadata Delivery Services, including the REST API and bulk metadata dumps. From this open source, OpenCitations will progressively import the citation data into the OpenCitations Corpus, describe them using the SPAR Ontologies according to the OCC metadata model, and make them available in RDF under a Creative Commons public domain dedication as Linked Open Data.  Potential users should be aware that is will take some considerable time before all the new citation data now available via the Crossref API are ingested into the OpenCitations Corpus.

I4OC links


[1] 40% is the percentage of publications with open references out of the total number of publications with reference metadata deposited with Crossref. As of March 2017, nearly 35 million articles with references are deposited with Crossref.

[2] Full list of publishers now making their citation data open via Crossref.

[3] Full list of I4OA supporting stakeholder organizations.

Open Citations is dead. Long live OpenCitations.

OpenCitations logo 50% with words greyBG

In October 2015, I asked Silvio Peroni, my long-term colleague in the development of the SPAR Ontologies, to become Co-Director of the Open Citations Project, and to work with me in taking forward the prototype Open Citations Corpus (OCC), originally developed at the University of Oxford with the support of Jisc, with the aim of developing it into a production service of real use to scholars.

The result is OpenCitations, a new instantiation of the OCC hosted by the Department of Computer Science and Engineering of the University of Bologna, based on a new metadata schema and employing several new technologies to automate the ingestion of fresh citation metadata from authoritative sources.

Since the beginning of July 2016, OpenCitations has been ingesting and processing accurate bibliographic references harvested from the reference lists of scholarly papers available in Europe PubMed Central, enriched by metadata from Crossref. These scholarly citation data are described using the SPAR Ontologies according to the new OpenCitations metadata document [1], and are published under a Creative Commons public domain dedication (CC0), so that others may freely build upon, enhance and reuse them for any purpose, without restriction under copyright or database law. We have described the new OpenCitations Corpus, and the new software developed by Silvio to create it, in [2].

OpenCitations is being continuously populated from the scholarly literature, and, as of 30th March 2017, has ingested the references from 123,989 citing bibliographic resources, and contains information about 5,307,857 citation links to 3,469,648 cited resources.

The whole OCC is now available for querying (via SPARQL), and for browsing by means of a very simple Web interface that shows only the data about bibliographic entities (e.g. https://w3id.org/oc/corpus/br/1). Additional more user-friendly interfaces will be available in the coming months. The entire contents of the OpenCitations Corpus (OCC) are also archived every month as data dumps that are made available online through Figshare. Each dump comprises several zip archives, each containing either data or provenance information of a particular sub-dataset of the OCC.

Despite the fact that OpenCitations presently contains only a small proportion of global citation data, it is important to realize that, because of the very nature of scholarly citation, even this partial coverage includes citations of the most important papers in every biomedical field, these critical papers being characterized by the high number of their inward citation links.

[1] Silvio Peroni, David Shotton (2016). Metadata for the OpenCitations Corpus. figshare. https://dx.doi.org/10.6084/m9.figshare.3443876

[2] Silvio Peroni, David Shotton, Fabio Vitali (2016). Freedom for bibliographic references: OpenCitations arise. Proceedings of 2016 International Workshop on Linked Data for Information Extraction (LD4IE 2016): 32-43.

Three publications describing the Open Citations Corpus

Last September, I attended the Fifth Annual Conference on Open Access Scholarly Publishing, held in Riga, at which I had been invited to give a paper entitled The Open Citations Corpus – freeing scholarly citation data.  A recording of my talk is available here, and my PowerPoint presentation is separately available here.  My own reflections on the major themes of the conference are given in a separate Semantic Publishing Blog post.

While in Riga preparing to give that talk about the importance of open citation data, I received an invitation from Sara Abdulla, Chief Commissioning Editor at Nature, to write a Comment piece for their forthcoming special issue on Impact.  My immediate reaction was that this should be on the same theme, an idea to which Sara readily agreed.  The deadline for delivery of the article was 10 days later!

As soon as the Riga conference was over, I first assembled all the material I had to hand that could be relevant to describing the Open Citations Corpus (OCC) in the context of conventional access to academic citation data from commercial sources.  That gave me a raw manuscript of some five thousand words, from which I had to distil an article of less than 1,300 words.  I then started editing, and asked my colleagues Silvio Peroni and Tanya Gray for their comments.

The end result, enriched by some imaginative art work by the Nature team, was published a couple of weeks later on 16th October [1], and presents both the intellectual argument for open citation data, and the practical obstacles to be overcome in achieving the goal of a substantial corpus of such data, as well as giving a general description of the Open Citations Corpus itself and of the development work we have planned for it.

Because of the drastic editing required to reduce the original draft to about a quarter of its size, all material not crucial to the central theme had to be cut.  I thus had the idea of developing the original draft subsequently into a full journal article that would include these additional themes, particularly Silvio’s work on the SPAR ontologies described in this Semantic Publishing Blog post [2], Tanya’s work on the CiTO Reference Annotation Tools described in this Semantic Publishing Blog post, and a wonderful analogy between the scholarly citation network and Venice devised by Silvio.  I also wanted to give authorship credit to Alex Dutton, who had undertaken almost all of the original software development work for the OCC.  For this reason, instead of assigning copyright to Nature for the Comment piece, I gave them a license to publish, retaining copyright to myself so I could re-use the text.  I am pleased to say that they accepted this without comment.

Silvio and I then set to work to develop the draft into a proper article.  The result was a ten-thousand word paper submitted to the Journal of Documentation a week before Christmas [3].  We await the referees’ comments!


[1]     Shotton D. (2013).  Open citations.  Nature 502: 295–297. http://www.nature.com/news/publishing-open-citations-1.13937. doi:10.1038/502295a.

[2]     Peroni S and Shotton D (2012). FaBiO and CiTO: ontologies for describing bibliographic resources and citations. Web Semantics: Science, Services and Agents on the World Wide Web. 17: 33-34. doi:10.1016/j.websem.2012.08.001.

[3]    Silvio Peroni, Alexander Dutton, Tanya Gray, David Shotton (2015). Setting our bibliographic references free: towards open citation data. Journal of Documentation, 71 (2): 253-277. http://dx.doi.org/10.1108/JD-12-2013-0166; OA at http://speroni.web.cs.unibo.it/publications/peroni-2015-setting-bibliographic-references.pdf

This is the main article about OpenCitations, which includes several background information and the main ideas and works supporting the whole project, the Corpus, and some possible future developments in terms of new kinds of data to be included, e.g. citation functions.


Open Citations – Doing some graph visualisations

Ongoing work on the Open Citations extensions project is now reaching the point of visualising – at very much a prototype level at this stage – the outputs of our earlier efforts to import and index the PubMed Central Open Access subset and arXiv.

Earlier in this project I asked David to specify a list of questions that he thought researchers might hope to answer by querying our Open Citations Corpus; the aim was to use these questions to guide our developments, in the hope of providing a striking interface that also did something useful – there are too many visualisations of data that look very pretty but that do not actually add much to the data. So, considering that list of questions and how one might visualise the data to not only a pretty but functional end, I set myself the following problem:

Identify what it is in a dataset that is not easy to find in a textual representation, and make it useful for search

Based on our earlier text search demonstrator the answer pretty soon became – of course – interactions; whilst the properties of a result object are obvious in a textual result set, the interactions between those objects are not – and sometimes, it is the interactions that one wishes to use as search parameters.

What I did

Having found my purpose, I set about applying the superb D3.js library to the problem, using it to draw SVG representations of elasticsearch query results directly into the browser. After testing with a number of different result layouts, I settled upon using a zoomable-pannable force-directed network graph and combined it up with some code from my PhD work to build in some connections on the fly. This is as mentioned earlier still a work in progress, but results so far are pretty good.

Take the image above, for example: this is a static representation of the interactions between David Shotton and all other authors (purple) with whom he has published an article (green) in the PMC OA subset. The red dots are the journals these articles are in, and the brown are citations. As a static image this could be fairly informative when marked up with appropriate metadata, and it does look quite nice; but, more than that, it can act as part of a search interface to enable a much improved search experience.

So far, the production of a given image also reduces the result set size; so whilst viewing the above image, the available suggestion dropdowns are automatically restricted to the subset of values relevant to the currenlty displaying image – dropdown suggestions are listed in order of popularity count, then upon typing one letter they switch to alphabetical, and with multiple letters they become term searches. By typing in free text search values or choosing suggestions, this visual representation of the current subset of results combined with the automated restriction of further suggestions should offer a simple yet powerful search experience. It is also possible to switch back to “list” view at any time, to see the current result set in a more traditional form. Further work – described below – will bring enhancements that add functionality to the elements of the visualisation too.

Try it

Similarly to the search result list demonstrator, it is possible to embed the visual search tool in any web page. However, as it looks better with full screen real estate I have saved that particular trick for the time being, and simply made it available at http://occ.cottagelabs.com/graphview.

Now before you rush off to try it, given the prototype state, you will need some pointers. Taking the above image as example once more, in order to reproduce it, do the following:

  • Use a modern browser – Chrome renders javascript the fastest – on a reasonably decent machine to access http://occ.cottagelabs.com/graphview – a large screen resolution would be particularly nice
  • Choose authors as a search suggestion type
  • Start typing Shotton – click on Shotton David when it appears in the list
  • (If the error where it appears to return all results again appears – described below – just keep going)
  • tick the various display options to add authors, journals and citations objects to the display

The next step would be to click an author or other entity bubble then choose to add that to the search terms, or start a new search based on that bubble or perhaps a subset of the returned bubbles; however this is all still in development.

For a more complex example, try choosing keywords then type Malaria. Once displayed, increase resultset size to 400 so they are all displayed. Then try selecting the various authors, journals and citations tickboxes to add those objects; try increasing the sizes to see how many you can get before your computer melts… On my laptop, asking for more than about 1000 of each object results in poor performance. But here is an example of the output – all 383 articles with the Malaria keyword in the PMC OA, showing all 70 journals in which they are published, with links to the top 100 authors and citations. Which journal do you think is the large purple dot in the middle?

Outstanding issues


  • Numerous buttons have no action yet – clear / help / prev / next / + search / labels. Once these and other search action buttons are added, the visualisation can become a true part of the search experience rather than just a pretty picture
  • Searches are sent asyncrhonously and occasionally overlap, resulting in large query result sizes overwriting smaller ones. This needs a delay on user interactions added.
  • Some objects should become one – for example some citations are to the same article via both DOI and PMID, and some citations are also open access articles in our index, so they shold be linked up as such.
  • There is as yet no visual cue that results are still loading, so it feels a bit in limbo. Easy fix.
  • Some of the client-side processing can be shifted to the backend (already in progress)
  • The date slider at the bottom is twitchy and needs smoother implementation and better underlying data (see below)

Data quality

Apart from the above technical tasks, we will need to re-visit our data pipeline in order to answer more of the questions set by David. For example we have very little affiliation data at present, and we are also missing a large amount of date information. Also some data cleaning is necessary – for example, keywords should all be lowercased to ensure we do not have subsets due solely to capitalisation. There are also certain types of data that we have no idea about as yet – for example author location, h-index, ORCID. However, this is all as to be expected at this stage, and overall our ability to so easily spot these issues shows great progress.

More to come

There is still work to be done on this graph interface, and in addition, we have some more demonstrators on the way too. In combination with the work on improving the pipeline and data quality, we should soon be able to perform queries that will answer more of our set questions – then we will identify what needs done next to answer the remaining ones!


David Shotton writes: This productive collaboration between Cottage Labs and the Open Citations Corpus came to an end when Jisc funding ran out.  The corpus has more recently been given a new lease of life, as described here, with a new instantiation named OpenCitations hosted at the Department of Computer Science and Engineering of the University of Bologna, with Silvio Peroni as Co-Director.

Open Citations Corpus Import Process

As part of the Open Citations project, we have been asked to review and improve the process of importing data into the Open Citations Corpus, taking the scripts from the initial project as our starting point.

The current import procedure evolved from several disconnected processes and requires running multiple command line scripts and transforming the data into different intermediate formats. As a consequence, it is not very efficient and we will be looking to improve on the speed and reliability of the import procedure. Moreover, there are two distinct procedures depending on the source of the data (arXiv or PubMed Central); we are hoping to unify the common parts of these procedures into a single process which can be simplified and normalised to improve code re-use and comprehensibility.

The Workflow

As PubMed Central provides an OAI-PMH feed, this could be used to retrieve article metadata, and for some articles, full text. Using this feed, rather than an FTP download (as used currently) would allow the metadata import for both arXiv and PubMed Central to follow a near-identical process, as we are already using the OAI-PMH feed for arXiv.

Also, rather than have intermediate databases and information stores, it would be cleaner to import from the information source straight into a datastore. The datastore could then be queried, allowing matches and linking between articles to be performed in situ. The process would therefore become:

  1. Pull new metadata from arXiv (OAI-PMH) and PubMed Central (OAI-PMH) and insert new records into the Open Citations Corpus datastore
  2. Pull new full-text from arXiv and PubMed Central, extract citations, and match with article data in Open Citations server, creating links between these references and the metadata records for the cited articles. Store unmatched citations as nested records in the metadata for each article.
  3. On a scheduled basis (e.g. nightly), review each existing article’s unmatched citations and attempt to match these with existing bibliographic records of other articles.

In outline, this looks like this:

The Datastore

Neo4J is currently used as the final Open Citations Corpus datastore for the arXiv data, by the Related Work system. We propose instead to use BibServer as the final datastore, for its flexibility and scalability, and suitability for the Open Citations use cases.

The Data Structure

The data stored within BibServer as BibJSON will be a collection of linked bibliographic records describing articles. Associated with each record and stored as nested data will be a list of matched citations (i.e. those for which the Open Citations Corpus has a bibliographic record), a list of unmatched citations, and a list of authors.

Authors will not be stored as separate entities. De-coupling and de-duplicating authors and articles could form the basis of a future project, perhaps using proprietary identifiers (such as ORCHID, PubMed Author ID or arXiv Author ID) or email addresses, but this will not be considered further in this work package.

Overall Aim

The overall aim of this work is to provide a consistent, simple and re-usable import pipeline for data for the Open Citations Corpus. In the fullness of time we’d expect it to be possible to add new data sources with minimal additional complexity. By using an approach whereby data is imported into the datastore at as early a stage as possible in the import pipeline, we can use common tools for extracting, matching, deduplicating citations; the work for each datasource, then, is just to convert the source data format into BibJSON and store it in BibServer.


David Shotton writes: This productive collaboration between Cottage Labs and the Open Citations Corpus came to an end when Jisc funding ran out.  The corpus has more recently been given a new lease of life, as described here, with a new instantiation named OpenCitations hosted at the Department of Computer Science and Engineering of the University of Bologna, with Silvio Peroni as Co-Director.

Open Citations – Indexing PubMed Central OA data

As part of our work on the Open Citations extensions project, I have recently been doing one of my favourite things – namely indexing large quantities of data then exploring it.

On this project we are interested in the PubMed Central Open Access subset, and more specifically, we are interested in what we can do with the citation data contained within the records that are in that subset – because, as they are open access, that citation data is public and freely available.

We are building a pipeline that will enable us to easily import data from the PMC OA and from other sources such as arXiv, so that we can do great things with it like explore it in a facetview, manage and edit it in a bibserver, visualise it, and stick it in the rather cool related-work prototype software. We are building on the earlier work of both the original Open Citations project, and of the Open Bibliography projects.

Work done so far

We have spent a few weeks getting to understand the original project software and clarifying some of the goals the project should achieve; we have put together a design for a processing pipeline to get the data from source right through to where we need it, in the shape that we need it. In the case of facetview / bibserver work, this means getting it into a wonderful elasticsearch index.

While Martyn continues work on the bits and pieces for managing the pipeline as a whole and pulling data from arXiv, I have built an automated and threadable toolchain for unpacking data out of the compressed file format it arrives in from the US National Institutes of Health, parsing the XML file format and converting it into BibJSON, and then bulk loading it into an elasticsearch index. This has gone quite well.

To fully browse what we have so far, check out http://occ.cottagelabs.com.

For the code: https://github.com/opencitations/OpenCitationsCorpus/tree/master/pipeline.

The indexing process

Whilst the toolchain is capable of running threaded, the server we are using only has 2 cores and I was not sure to what extent they would be utilised, so I ran the process singular. It took five hours and ten minutes to build an index of the PMC OA subset, and we now have over 500,000 records. We can full-text search them and facet browse them.

Some things of particular interest that I learnt – I have an article in the PMC OA! And also PMIDs are not always 8 digits long – they appear in fact to be incremental from 1.

What next

At the moment there is no effort made to create record objects for the citations we find within these records, however plugging that into the toolchain is relatively straightforward now.

The full pipeline is of course still in progress, and so this work will need a wee bit of wiring into it.

Improve parsing. There are probably improvements to the parsing that we can make too, and so one of the next tasks will be to look at a few choice records and decide how better to parse them. The best way to get a look at the records for now is to use a browser like Firefox or Chrome and install the JSONview plugin, then go to occ.cottagelabs.com and have a bit of a search, then click the small blue arrows at the start of a record you are interested in to see it in full JSON straight from the index. Some further analysis on a few of these records would be a great next step, and should allow for improvements to both the data we can parse and to our representation of it.

Finish visualisations. Now that we have a good test dataset to work with, the various bits and pieces of visualisation work will be pulled together and put up on display somewhere soon. These, in addition to the search functionality already available, will enable us to answer the questions set as representative of project goals earlier in January (thanks David for those).


David Shotton writes: This productive collaboration between Cottage Labs and the Open Citations Corpus came to an end when Jisc funding ran out.  The corpus has more recently been given a new lease of life, as described here, with a new instantiation named OpenCitations hosted at the Department of Computer Science and Engineering of the University of Bologna, with Silvio Peroni as Co-Director.

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑