The academic discovery space seems to be buzzing again. This space has become relatively stable after the introduction and maturity of Web Scale Discovery between 2009-2013, but things seem to be hotting up once again.
With the recent interest in integrating discovery of open access, as well as linked data (with a dash of machine learning and text mining) we have the beginnings of an interesting situation. A third development which was harder to forsee is the rise in Open Citation movement which I will focus on in this post.
How did this movement begin? How does the size of the open citations compared to gold standards like Scopus and Web of Science? Who are the players that use it (e.g. Digital Science Dimensions) and how might it develop in the future?
In recent years, we have seen the launch of many innovative discovery search engines, such as
But in terms of citation indexes, we see the offical launch of 2 new comprehensive citation indexes – Microsoft’s Academic and Digital Science’s Dimensions taking their places along side the big 3 namely Web of Science, Scopus and Google Scholar.
Microsoft’s Academic was in beta for 2 years before finally launching officially late 2017. I’ve covered the preview version in a blog post in May 2017 last year. A interesting thing to note is that unlike it’s more famous rival Google Scholar, Microsoft data’s is somewhat open and is available via the academic knowledge API.
But let’s focus on the latest one to join the fray Digital Science’s Dimensions.
Digital Science’s Dimensions
I’ll do a full review in a later post but the bit I want to focus on is that Dimensions uses open citation data from Initiative for Open Citations (I4OC) as well as from other sources like ORCID, oadoi and Grid.
Thanks to @i4oc_org for making things like @DSDimensions possible! Really exciting to be involved with such a fantastic initiative #highered
— Dimensions (@DSDimensions) January 19, 2018
Open Citations? What manner of beast is that?
If you haven’t been keeping track of this development, this post is for you.
Open Citations Corpus
While many are aware of the push for Open Access, I suspect fewer are aware of the push for open citations. This is a call put out by The Initiative for Open Citations (I4OC), to make citations open.
But let’s take a step back.
For a long time, the only way to get citation data was via paid citation indexes – either via Clarivate’s Web of Science or Elsevier’s Scopus.
But this changed fairly recently, for example let’s consider Open Citations Corpus (OCC) , which is a publisher of Open Citations.
OpenCitations Corpus (OCC) is “an open repository of scholarly citation data made available under a Creative Commons public domain dedication, which provides in RDF accurate citation information (bibliographic references) harvested from the scholarly literature.”
But how big is it?
“As of January 20, 2018, the OCC has ingested the references from 302,758 citing bibliographic resources and contains information about 12,830,347 citation links to 65,49,665 cited resources.”
This seems decent, but you might be wondering, where did all the open citations come from?
Open Citations and it’s relationship to Crossref
First off, most of the open citations available particularly for journal articles and book chapters actually comes from Crossref.
You might find this surprising, but when publishers submit their article metadata (e.g. title, author, journal) to Crossref for DOI registration many of them (around 1/3 of publishers including most of the big ones) also choose to submit references of articles in their journals.
Why would publishers do that? This is because doing so will give them access to CrossRef’s cited-by service to publishers which has been in operation since 2007.
So what does it do? It is actually a service by Crossref. to help publishers check what items are citing their articles.
Publishers who are allowed to use the cited-by service can use an API to retrieve information for displaying cites on papers in their website.
They can see not just the told number of cites from other items in Crossref but also the actual references. Do note that by default, they can only see cites to their items and not cites to other publishers.
This cited-by service is offered free to Crossref publishers by Crossref but there is one obligation.
To use this service, besides depositing the usual article metadata (title, author etc) into Crossref, they will also need to deposit the references of the articles.
Do note that the publisher choosing to deposit references into Crossref isn’t sufficient to make it open to everyone, it merely gives the publisher access to the cited-by service and not everyone else.
While anyone can access the counts via the usual Crossref API, the citations themselves needs to be explictly made open by the publisher depositing the references.
The reference distribution policy by Crossref dated Jan 2018 allows publishers to set their reference to one of the following levels.
- Open – anyone can access citations via standard Crossref API
- Limited – only accessible via new paid Crossref Metadata API plus
- Closed – Not usable by anyone. Used only in cited by service – i.e only publisher of item that was cited will see the citation.
It is these sub-set of references deposited by publishers that are made open that makes up the bulk of the citations in the OpenCitations Corpus (OCC).
This is where the newly formed Initiative for Open Citations (I4OC) in 2017, comes in.
Impact of I4OC on open citations
I4OC has achieved great success in encouraging publishers to make the references they submit into crossref open. As of Jan 2018, publishers have made “more than 50% out of 38 million articles with references deposited with Crossref.”
When they first started it was 1%.
The list of major publishers who have deposited references and made their citations open are amazing. Most of the big publishers such as Springer-Nature, Taylor and Francis, Wiley and Sage are already doing this. See list of publishers here.
How significant is this achievement relatively speaking?
First, notice that the 50% open citations figure above refers to 50% of “articles with references that are deposited in Crossref” and this excludes articles that do not have references deposited.
How do things look like after we take that into account.
For non-journal items (mostly book chapters) only 20.4% are deposited with references.
How does the citations in Crossref (both open and non-open) compare with Scopus and Web of Science?
While the above analysis is interesting, the traditional gold standard for citation indexes is Web of Science and Scopus. How does references deposited into crossref compare?
The upshot is around 39.7% references in Web of Science match a open reference, and this figure is 34.8% for Scopus.
If all references in CrossRef were included (both closed and open) , this would rise to 77.1% and 69.1% respectively. This isn’t too bad, particularly since the authors note that due to matching difficulties for doi, these figures are a lower bound on the actual figure.
Improving coverage of Open Citations – 2 ways
There are two ways to improve coverage of Open citations. Firstly, get publishers who already deposit references to Crossref but keep their references closed (see list here) to make them open. Secondly , we need to get publishers who are not depositing references at all to do so.
“Elsevier references dominate those that are not open at Crossref”
So who are the major hold-outs? There are a few but the major culprit here appears to be Elsevier.
In a post entitled “Elsevier references dominate those that are not open at Crossref“, the authors find that of the 470 million references in journal articles deposited in Crossref that are not made open a impressive 65.1% of them are from Elsevier articles!
This implies that Elsevier has relatively few missing citations (needed to match Scopus) not already deposited in CrossRef.
Ludo Waltman , one of the authors of the CWTS paper agrees.
Indeed, Elsevier is carefully depositing its references in @CrossrefOrg, but it does not make references openly available; Springer Nature does make references openly available, but a large number of references in books have not been deposited in @CrossrefOrg at all
— Ludo Waltman (@LudoWaltman) January 20, 2018
How Open Citations are currently used
I have mentioned Vosviewer a couple of times in the past, and as earlier mentioned Vosviewer works with the Crossref API, so the more citations are made open, the richer the information users will see.
But essentially open citation data can be ported into Wikidata so one can do SPARQL queries like “Top cited female researchers in Denmark”., or create citation graphs of articles or people.
But by far, I think the main way people are going to access data based on Open Citations will be via Digital Science’s Dimensions.
Digital Science Dimensions and it’s use of Open Citations
I won’t write a lot about the background of Dimensions, Roger Schonfeld has a good piece breaking the news about it. and I will be reviewing it soon.
But for the purposes of this piece, the most significant thing about Dimension is that the data is at least partly based on Open Citations from I4OC.
I’m pretty sure Dimensions goes beyond it , as it is a combination of input and expertise from 6 different teams including ReadCube, Altmetric, Figshare, Symplectic, DS Consultancy and ÜberResearch and other publisher partners.
It currently boasts 89 million publications and 870 million citations, which is substantially beyond the number of open citations in Crossref I believe.
When I enquired on how much more was in Dimensions compared to via OpenCitations Dimensions had this to say.
@aarontay great question – quick answer: in addition to I4OC Dimensions is built on improving discoverability of +50 million records by processing the full-text – not only references but also acknowledgements. Some of them are part of I4OC data, some not. #moretofollow #takestime https://t.co/NMrynOBBq2
— Dimensions (@DSDimensions) January 20, 2018
On a sidenote, Dimensions is taking an inclusive approach so has more items than Scopus, though the number of citations is currently still substantially lower than Scopus, so it appears to have as many if not more items than Scopus.
However one wonders if there is a Elsevier sized hole in the citation data in Dimensions, given that those references are not made open. Are the additional layers that Digital Science build on top of open citations sufficient to fill in this gap? Interesting questions to ponder.
Another one to consider, while you can access the citation data from Dimensions (including the Open Citations) for free, there are limits to what you can do.
Bianca Kramer a leading librarian in Scholarly communication makes a distinction between products making use of open data and those that are truly open.
In a comment to Roger Schonfeld’s piece on Dimensions, she writes
“In practice though, Dimensions, while perhaps partly building on publicly available data (e.g. from oaDOI), is not contributing to it. The freely accessible version of Dimensions might be very useful for certain purposes, but it doesn’t allow access, export and (re)use of the underlying (meta)data, thereby remaining a commercial party’s closed silo. This is very different from building on open data and, as one business model, charging for the value of all (in a paid model) or some (in a freemium model) of these functionalities, while ensuring that the underlying data are and will remain publicly available. Then citation data would also no longer be a commodity, but truly a public good.”
As the size of open citations grow, more and more services will sprout up to exploit the data, it will be interesting to see what business models these new services will provide.
I hope this tour of open citations , it’s scale compared to other citation indexes and how it is used has been useful.
It also seems a new rivary might be brewing. The library world has long witnessed the struggle between Proquest and Ebsco in the library discovery space. Both serve as both library discovery providers (with a central index) as well as owning a portfolio of content. This has famously led to stand-offs where both side refused to share metadata and full-text to each other’s central index and poor or totally lack of integration between products and services (e.g. link resolvers, library management systems e.g. Alma, Folio) that belong to their stable of products.
Roger Schonfeld proposes that a similar lock-in situation with perhaps even more far reaching consequences around researcher workflow might be emerging with a duopoly with Elsevier on one-side with their stable of services and potentially Digital Science (and possibly with aid of co-owned Springer-Nature) on another side. Dimensions vs Scopus could just be the first salvo in a long battle ahead.