Early adopters of the OpenCitations Data Model

This post was originally published on this site

OpenCitations is very pleased to announce its collaboration with four new scholarly Research and Development projects that are early adopters of the recently updated OpenCitations Data Model, described in this blog post.

The four projects are similar, in that they each are independently using text mining and optical character recognition or PDF extraction techniques to extract citation information from the reference lists of published works, and are making these citations available as Linked Open Data. Three of the four will also use the OpenCitations Corpus as publication platform for their citation data.  The academic disciplines from which these citation data are being extracted are social science, humanities and economics.

1     Linked Open Citation Database (LOC-DB)

The Linked Open Citation Database, with partners in Mannheim, Stuttgart, Kiel, and Kaiserslautern (LOC-DB, https://locdb.bib.uni-mannheim.de/blog/en/), is the first of two German projects funded by the Deutsche Forschungsgemeinschaft (DFG) that are extracting citations from Social Science publications.  Dr. Annette Klein, Deputy Director of the Mannheim University Library, is the project manager.

The project is using Deep Neural Networks based approaches for reference detection and state-of-the-art methods for information extraction and semantic labelling of reference lists from electronic and print media with arbitrary layouts [3].  The raw data obtained will be manually checked against and linked with existing bibliographic metadata sources in an editorial system.  They will then be structured in RDF using the OpenCitations Data Model, and published in the Linked Open Citations Database under a CC0 waiver. Using its libraries’ own Social Science print holdings and licensed electronic journals as subject material, this project will demonstrate how these citation extraction processes can be applied to the holdings of individual academic libraries, and can be integrated with library catalogues [1, 2, 3].


[1]       Kai Eckert, Anne Lauscher and Akansha Bhardwaj (2017) LOC-DB: A Linked Open Citation Database provided by Libraries. Motivation and Challenges.  EXCITE Workshop 2017: “Challenges in Extracting and Managing References”.  https://locdb.bib.uni-mannheim.de/wordpress/wp-content/uploads/2016/11/LOC-DB@EXCITE.pdf

[2]       Anne Lauscher, Kai Eckert, Lukas Galke, Ansgar Scherp, Syed Tahseen Raza Rizvi, Sheraz Ahmed, Andreas Dengel, Philipp Zumstein, Annette Klein (2018). Linked Open Citation Database: How much would it cost if libraries catalogued and curated the citation graph? (working title) Accepted for the JCDL 2018: Joint Conference on Digital Libraries 2018, June 3-6, 2018 in Fort Worth, Texas. https://locdb.bib.uni-mannheim.de/wordpress/wp-content/uploads/2018/03/LOCDB-JCDL2018-paper.pdf [Preprint of the conference publication]

[3]       Bhardwaj A., Mercier D., Dengel A., Ahmed S. (2017). DeepBIBX: deep learning for image based bibliographic data extraction. In: Liu D., Xie S., Li Y., Zhao D., El-Alfy ES. (eds) Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science, vol 10635. Springer, Cham [Conference publication].

2     The EXCITE (Extraction of Citations from PDF Documents) Project

The EXCITE Project (http://west.uni-koblenz.de/en/research/excite/), run jointly at the University of Koblenz-Landau and GESIS (Leibniz Institute for Social Sciences), is the second project funded by the Deutsche Forschungsgemeinschaft (DFG) that is extracting citations from Social Science publications.  It is headed by Steffen Staab, head of the Institute for Web Science and Technologies at the University of Koblenz-Landau, and Philipp Mayr of GESIS.

Since the social sciences are given only marginal coverage in the main bibliographic databases, this project aims to make more citation data available to researchers, with a particular focus on the German language social sciences.  It has developed a set of algorithms for the extraction of reference information from PDF documents and for matching the reference entry strings thus obtained against bibliographic databases (see EXCITE git https://github.com/exciteproject/).  It is using as its data sources the following Social Science collections: full texts from SSOAR, the Gesis Social Science Open Access Repository (https://www.gesis.org/ssoar/home/) and scattered pdf stocks from other social science collections including SOLIS, Springer Online Journals and CSA Sociological Abstracts [4, 5].

The EXCITE project organized an international developer and researcher workshop “Challenges in Extracting and Managing References” in March 2017 in Cologne. http://west.uni-koblenz.de/en/research/excite/workshop-2017

EXCITE will then structure the extracted bibliographic and citation data in RDF using the OpenCitations Data Model, and will use the OpenCitations Corpus as its publication platform, employing the OCC EXCITE supplier prefix 0110, described here, to identify the provenance of these citations.


[4]       Martin Körner (2016). Extraction from social science research papers using conditional random fields and distant supervision, Master’s Thesis, University of Koblenz-Landau, 2016.

[5]       Körner, M., Ghavimi, B., Mayr, P., Hartmann, H., & Staab, S. (2017). Evaluating reference string extraction using line-based conditional random fields: a case study with german language publications. In M. Kirikova, K. Nørvåg, G. A. Papadopoulos, J. Gamper, R. Wrembel, J. Darmont, & S. Rizzi (Hrsg.), New Trends in Databases and Information Systems (Bd. 767, S. 137–145). Springer International Publishing. https://doi.org/10.1007/978-3-319-67162-8_15   Preprint: https://philippmayr.github.io/papers/Koerner-et-al2017.pdf

3    The Venice Scholar Index

The Venice Scholar Index is a citation index of literature on the history of Venice, indexing nearly 3000 volumes of scholarship from the mid 19th century to 2013, from which some 4 million bibliographic references have been extracted.

The Venice Scholar Index is the first prototype resulting from Linked Books Project (https://dhlab.epfl.ch/page-127959-en.html), a project spearheaded by Giovanni Colavizza and Matteo Romanello of the Digital Humanities Laboratory at EPFL (École Polytechnique Fédérale de Lausanne), with partners in Venice, Milan and Rome.

The project is exploring the history of Venice through references to scholarly literature as well as archival documents found within publications.  To achieve this goal, the project has developed a system to automatically extract bibliographic references found within a large set of digitized books and journals, which has then been applied to the publications on the history of Venice, its main use case [6].

The Linked Books Project is specifically interested in analysing the interplay between citations to primary (e.g. archival) documents and those to secondary sources (scholarly literature), and the citation profiles of publications through time.  To this end, it developed the Venice Scholar Index, a rich search interface to navigate through the resulting network of citations, with the final aim of interlinking digital archives and digital libraries.

The citation data underlying the Venice Scholar Index are modelled using the OpenCitations Data Model, and will use the OpenCitations Corpus as its publication platform, using the OCC Venice Scholar Index supplier prefix 0120 to identify the provenance of these citations.


[6] Giovanni Colavizza, Matteo Romanello, and Frédéric Kaplan (2017). The references of references: a method to enrich humanities library catalogs with citation data. In International Journal on Digital Libraries 18 (March 8, 2017): 1–11. https://doi.org/10.1007/s00799-017-0210-1.

4    CitEcCyr – Citations in Economics published in CyrillicCitEcCyr  (https://github.com/citeccyr/CitEcCyr) is an open repository of citation relationships obtained from research papers in the Russian language and Cyrillic script from Socionet (https://socionet.ru/) and RePEc (http://repec.org/) [7, 8].  The CitEcCyr project is headed by Oxana Medvedeva, is technically led by Sergey Parinov, and is funded by RANEPA (http://www.ranepa.ru/eng/), the Russian Presidential Academy of National Economy and Public. CitEcCyr is also developing a suite of open software for the citation content analysis of these papers.  This project intends to model its citations using the OpenCitations Data Model, and will use the OpenCitations Corpus as its publication platform, using the OCC CitEcCyr supplier prefix 0140 to identify the provenance of these citations.

However, since this is the first project from which OpenCitations will be importing bibliographic metadata and citations in a language other than English and in a script other than the Latin script, we at OpenCitations are going to have to crawl out of our comfortable ‘Western’ shells and learn to handle foreign languages and scripts other than Latin scripts.

For Russian language papers written using Cyrillic script, we at OpenCitations will to decide how best to handle Russian language written using Cyrillic script, Cyrillic script transliterated into Latin script, and Russian language translated into English and rendered using Latin script.  In particular, since in the OpenCitations Corpus our reference entry records are the uncorrected literal texts of the references in the reference lists of the citing papers, these will need to be recorded as given in Cyrillic.

We will need to develop a policy for when to provide Latin script translations of (for example) titles and abstracts, if these are not provided by the data supplier.  To facilitate use of the OpenCitations Corpus by Russian scholars, we will also need to modify the OpenCitations web site, so as to render the static information displayed in the web pages in the language and script appropriate to the language setting on the user’s web browser.

Unfortunately, all this will take time, so we do not anticipate publishing citation data from the CitEcCyr project within OCC any time soon.  However, this collaboration will be of tremendous value to OpenCitations as well as to CitEcCyr, since the lessons learned by our collaboration with the CitEcCyr project will enable the OpenCitations Corpus to handle citation data not just in Russian, but also in Arabic, Chinese, Japanese and other languages where the Latin script is not used, something that is not found in other major bibliographic databases.

Watch this space!


[7]       Jose Manuel Barrueco, Thomas Krichel, Sergey Parinov, Victor Lyapunov, Oxana Medvedeva and Varvara Sergeeva (2017).  Towards open data for the citation content analysis.    https://arxiv.org/abs/1710.00302

[8]       Thomas Krichel (2017). CitEc to CitEcCyr – A stab at distributed citation systems.  Presented at the 2017 EXCITE workshop. http://west.uni-koblenz.de/sites/default/files/research/projects/excite/workshop-2017/slides/excite-workshop-2017_krichel_citec-to-citeccyr.pdf

Comments are closed.

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑