Requirements for citations to be treated as First-Class Data Entities
In my introductory blog post, I listed five requirements for the treatment of citations as first-class data entities. The second of these requirements is that they must have metadata structured using a generic yet appropriately detailed data model.
To fulfil that requirement, OpenCitations is pleased to announce the publication on 13 February 2018 of the OpenCitations Data Model, v1.6 . This replaces the previous version, v1.5.3, published on 13 July 2016.
The data model has been expanded and enhanced to improve the recording of publication dates, to include the treatment of citations as first-class data entities, and to permit the model’s adoption by third parties who may wish to use it to model their own citation data, or to prepare their citation data for publication in the OpenCitations Corpus (OCC). To facilitate this, the document describing this data model is published under a Creative Commons Attribution 4.0 International license.
In addition to a change in the title from “Metadata for the OpenCitations Corpus” to “The OpenCitations Data Model”, and the use of the name “OpenCitations” (one token with two words in camel case) in place of “Open Citations” (with the space separating the two words), the substantive changes in the model from the previous version are as follows:
A new class, Archival document, has been added as a subclass of bibliographic resource, to permit the model to be used for work on ancient manuscripts.
The mechanism for recording the publication dates of bibliographic resources has been improved, and now accepts the full date of publication (yyyy-mm-dd, if available), or the year plus the month of publication (yyyy-mm, if the full date is not available), or failing that just the year of publication (yyyy, as in the previous version of the data model). In order to support this modification in the OWL mapping, prism:publicationDate is now used instead of fabio:hasPublicationYear.
Citations as first-class data entities
A new class of bibliographic entity, Citation, has been added to permit the description of citations as first-class data entities. This class has been assigned sub-classes (e.g. Author self-citation) and properties (e.g. citation time span) to permit the description of citations in a manner helpful for bibliometric analysis. These, and associated changes to CiTO, the Citation Typing Ontology, are described more fully in the previous blog post.
The OpenCitations Data Model now permits the definition of virtual entities, i.e. bibliographic entities that are defined on-the-fly, only when they are requested (for example, by accessing their URLs). These are defined either by using data relating to non-virtual bibliographic entities that are already available within the OCC, or by using data that are themselves obtained on-the-fly from an external supplier (e.g. Wikidata).
This approach of using virtual RDF resources is optional, and is simply employed for storage efficiency, to avoid duplication of information within the OCC triplestore. As of January 2018, only one type of bibliographic entity is defined as a virtual entity, namely a citation (a members of the class Citation).
Such a virtual entity does not have the full provenance information normally associated with other bibliographic entities within the OCC, but it does have associated with itself the date of its creation and direct links both to the agent responsible for such creation and to the source data used in its construction.
Because we do not separately store these virtual entities within the Corpus triplestore, they cannot be directly queried by means of the OCC SPARQL end-point, neither are they stored within its data dumps. However, the data associated with an OCC virtual entity can be obtained by accessing its URL, which has form “https://w3id.org/oc/virtual/xyz”, clearly distinguishable from those URIs used for other (non-virtual) OCC bibliographic entities which have the form “https://w3id.org/oc/corpus/xyz”. More details and examples are given in the Data Model document itself.
Additionally, for citations defined using Open Citation Identifiers (OCIs, described in a subsequent blog post), details of the cited and citing publications may be readily obtained by using the Open Citation Identifier Resolution Service at http://opencitations.net/oci.
To enable citation data created by third parties to be incorporated within the OpenCitations Corpus, from February 2018 the OCC local identifiers for bibliographic resources now include a supplier prefix which clearly identifies the provenance of the data. The prefix consists of a positive number (following the pattern “nnn”, where “nnn” is a string of numerals of variable length which includes no zeros), enclosed between two zeros (e.g. “0420”).
To ensure uniqueness of prefixes used by different suppliers, all organizations wishing to adopt the OpenCitations Data Model and to use it to create publicly available citation data, whether these are published in the OpenCitations Corpus or independently, must apply to OpenCitations for a unique supplier prefix, by sending an email to firstname.lastname@example.org. A list of already assigned supplier prefixes is available at https://github.com/opencitations/oci/blob/master/suppliers.csv.
The appropriate supplier prefix is combined with a unique numerical string that forms the ‘body’ of the identifier to create the local identifier used in OCC to identify an individual bibliographic resource. OCC local identifiers for citations (as opposed to bibliographic resources) are constructed by combining the local identifiers for the citing and cited bibliographic resources, separating them with a dash. Thus, for a citation between two bibliographic resources described in an external bibliographic database where they are each identified by an identifier having a unique numerical part, the OCC local identifiers for the citing and cited bibliographic resources are combined, separating them with a dash.
For example, the citation between citing Wikidata resource Q27931310 and cited Wikidata resource Q22252312 is given the OCC local citation identifier “01027931310-01022252312”, where “010” is the OCC supplier prefix (defined above) for Wikidata. How these OCC local identifiers for citations are used to create Open Citation Identifiers is described in a separate blog post.
We commend the OpenCitations Data Model to anyone considering the storage of citation information, particularly if it is to be encoded in RDF, and we welcome contributions of citation data encoded using this model for publication within the OpenCitations Corpus.
 Silvio Peroni, David Shotton (2018). The OpenCitations Data Model. Version 1.6. figshare. https://doi.org/10.6084/m9.figshare.3443876