How we’re building the ‘mountain chalet’ of complex conversions

When scaling great heights, sometimes you need a place to rest before moving on.

That’s one analogy for XSweet, a toolkit under development by the Coko Foundation. It offers a set of stylesheets for extraction and refinement of data from MS Office Open XML (.docx) format, producing HTML for editorial workflows.

XSweet developer Wendell Piez offered that parallel in a recent presentation at JATS-Con 2017. The two-day conference centers around Journal Article Tag Suite (JATS), an XML format for marking up and exchanging journal content.

The toolkit offers a new path to document conversion — instead of heading first to a format like JATS, XSweet delivers the document into HTML, the lingua franca of the web. Once the document is in HTML, it can be processed in a web-based workflow, progressively improved using browser tools and easily go out to other formats from there. What was once a tedious trek becomes a journey where collaborators focus on what matters — editing and determining the details of publishing. Details of his talk are available as part of the conference proceedings.

XSweet offers “refuge” from the slog of conversion because instead of immediately trying to produce structured JATS from unstructured Docx, it produces a faithful rendering of a Word document’s appearance translated into a vernacular HTML/CSS.

In a 45-minute session titled “HTML First? Testing an alternative approach to producing JATS from arbitrary (unconstrained or “wild”) .docx (WordML) format,” Piez walked the audience through a mini-editorial process: taking a Word docx file sent by an author and pushing it through XSweet to produce an HTML file.  “The few hours it took me to produce BITS from the docx original, that was both faithful and also better for further editing and application, were minimal in comparison to the time we were then able to spend on things that really mattered,” Piez said.

Piez is pleased about how the talk went.  “A number of audience members approached me afterwards, many of whom had themselves looked this problem in the face before and were willing to confirm the sense of the problem and approaches to it.”

Introducing Texture: An Open Source WYSIWYG Javascript Editor for JATS

Texture is a WYSIWYG editor app that allows users to turn raw content into structured content, and add as much semantic information as needed for the production of scientific publications. Texture is open source software built on top of Substance (, an advanced Javascript content authoring library. While the Substance library is format agnostic, the Texture editor uses JATS XML as a native exchange format. The Substance library that Texture is built on already supports real-time collaborative authoring, and the easy-to-use WYSIWYG interface would make Texture an attractive alternative to Google Docs. For some editors, the interface could be toggled to more closely resemble a professional XML suite, allowing a user to pop out a raw attribute editor for any given element. Texture-authored documents could then be brought into the journal management system directly, skipping the conversion step, and move straight into a document-centric publishing workflow.

Read full story

HTML First?: Testing an alternative approach to producing JATS from arbitrary (unconstrained or “wild”) .docx (WordML) format

XSweet, a toolkit under development by the Coko Foundation, takes a novel approach to data conversion from .docx (MS Word) data. Instead of trying to produce a correct and full-fledged representation of the source data in a canonical form such as JATS, XSweet attempts a less ambitious task: to produce a faithful rendering of a Word document’s appearance (conceived of as a “typescript”), translated into a vernacular HTML/CSS. It is interesting what comes out from such a process, and what doesn’t. And while the results are barely adequate for reviewing in your browser, they might be “good enough to improve” using other applications.

One such application would produce JATS. Indeed it might be easier to produce clean, descriptive JATS or BITS from such HTML, than to wrestle into shape whatever nominal JATS came back from a conversion processor that aimed to do more. This idea is tested with a real-world example.

Read full story

Beware of the laughing horse: Managing a back-catalogue conversion

The International Standardization Organization (ISO) is an independent member-based non-governmental organization with 161 national standards bodies. ISO brings together experts to develop voluntary consensus-based International Standards. These are then disseminated through the ISO national membership. In 2011, ISO embarked on their XML journey, with the following aims:

  1. Creation of a central repository of standards
  2. Improve speed to market also for national adoptions
  3. Broaden readership
  4. Reduce or avoid duplication of costs
  5. Streamline ISO production processes

The base DTD chosen was JATS and customizations were made to be able to capture standards-type metadata and content. This became known as the ISOSTS (standards tag set). The first acid test of the DTD was to create the central repository with content in a common form, ie, convert ISO’s legacy content from Word/cPDF/scanned PDF to ISOSTS-compliant XML. How ISO went about this task is the subject of this paper.

Read full story

Adoption without Disruption: NCBI’s Experience in Switching to BITS

The NCBI Bookshelf at the National Library of Medicine is an online archive of books and documents in life science and healthcare. Its growing collection comprises over 5,000 titles, the majority of which are stored as full text XML. In the fall of 2014, Bookshelf began work to adopt the Book Interchange Tag Suite (BITS) DTD, replacing the NCBI Book Tag Set Version 2.3 as its XML format of choice. It became immediately apparent that Bookshelf could not simply perform a one-time “switch” to BITS. It needed to support the new schema alongside the old one. The complexity of the project would have required Bookshelf to focus so much energy on the transition to BITS, thereby bringing regular production workflows to a complete halt. This was particularly inconceivable, as Bookshelf judged the benefits of adopting BITS to be mostly long-term rather than immediate. Released only in December 2013, BITS was still very new. While there was no doubt that the format is superior to the NCBI Book DTD, the prospect of further revisions to the Tag Suite cautioned against acting too quickly. Therefore, adoption of BITS was conceived as a longer term project of small, incremental steps designed to neither disrupt the regular production cycle nor consume all resources. By the time version 2.0 of BITS was released in December 2015, Bookshelf had the ability to load, render, and index books tagged as per the BITS. A number of in-house XML converters were updated to output BITS, and the first titles in BITS were released. While the majority of new content was still tagged as per the NCBI Book Tag Set v2.3, Bookshelf now had a solid foundation to complete adoption using the new version of BITS. By the end of 2016, all workflows had switched to BITS, including Bookshelf’s Word authoring program, external vendors providing BITS XML, and over 20 in-house XML converters. This paper describes Bookshelf’s experience in adopting BITS: the challenges Bookshelf faced, the solutions it developed, and the lessons learned along the way. Special emphasis is placed on issues related to markup and XML conversion.

Read full story

In pursuit of family harmony: Introducing the JATS Compatibility Meta Model

JATS is an Open Standard. Users may modify it by adding or removing elements and attributes to suit their needs. Some publishers have extended (added to) JATS based on their own requirements. And there are some public extensions like BITS, STS, and Taxpub. Users expect significant efficiencies from vocabularies based on JATS, including the ability to intermingle the documents in databases, to use tools created for JATS for their new vocabulary with minimal additional work, and to adopt rendering/formatting applications and change only those aspects specific to the new vocabulary. Some model changes create compatible documents, which can interoperate with JATS documents gracefully. But some model changes are disruptive. We discuss what types of changes to the JATS models can be integrated into existing XML environments and which may be disruptive. We propose a set of criteria to evaluate whether a proposed change will be seamless or might cause problems.

Read full story

Circling in on the JATS Compatibiliy Meta-Model

The JATS Meta-Model was developed to guide people who want to customize JATS to meet local needs and have their JATS-based vocabularies work gracefully with existing JATS-based infrastructure. From analyzing content models to defining “social behaviors” of XML elements, the process of defining the JATS Compatibility Meta-Model was rarely straightforward and very often led us to surprising conclusions. Why, for instance, is whether or not something is metadata not a defining property of compatibility? This paper aims to explain the process and thinking behind the model—how we came to the conclusions about compatibility and what we even mean by compatibility. We’ll look at some of the assertions we started absolutely knowing to be important, and discuss why they’re ultimately not in the Meta-Model. By examining the process behind the model and sharing our successes and failures, we hope to improve understanding of the model and its broader implications.

Read full story

PubMed: Redesigning citation data management

Over the last couple years, we have drastically changed the systems and process used to manage PubMed citation data. It began with revising long-standing NLM policies and reducing reliance on manual citation corrections, then culminated with the release of the PubMed Data Management (PMDM) system in October 2016. With PMDM, we introduced a single system for managing citation data with a UI for editing citation data. In this brave new world, the responsibility for correcting citation data shifted from NLM Data Review to PubMed data providers. Any errors reported in PubMed citations are now forwarded to the publisher ― a strategy that publishers have enthusiastically upheld. Here, we outline how the systems and process for managing PubMed citation data have changed, and detail the outcome of these changes since PMDM was launched.

Read full story

JATS Subset and Schematron: Achieving the Right Balance

Ensuring that published content adheres to the publisher’s business and style rules requires the implementation of quality-control solutions that encompass the entire enterprise, including vendors and in-house staff. The solutions must span the entire life cycle of the manuscript, from XML conversion to production to post-publication enhancements. Two techniques that may help in achieving this goal are 1) developing Schematron and 2) making a JATS subset. Both come with costs: Schematron change management requires development and maintenance of an extensive testbase; making a subset requires comprehensive content analysis and the knowledge of the publishing program’s direction. Achieving the right balance between the two techniques may reduce the costs associated with them.

In this paper, we revisit the notion of “appropriate layer validation” at the current state of technology. We share the experience of running a successful large-scale quality-control operation that has been accomplished by using a combination of JATS subset and Schematron. After demonstrating what Schematron change management entails, analyzing the advantages and costs associated with building Schematron and with creating a subset, and considering several validation scenarios, we conclude with the suggestion that the two techniques, when used in tandem, may complement one another and help control software development costs.

Read full story

Implementation of JATS at Taylor & Francis

Taylor & Francis has a long history of using XML to publish journal content online, which includes the development of a DTD for publishing journal articles and implementation of the JATS DTD. Currently, in 2017, work is being done to upgrade to JATS version 1.1. This paper describes Taylor & Francis’ proprietary DTD, processes for XML quality control, strategies for transitioning from a proprietary DTD to the JATS DTD, and customizations made to the JATS DTD.

Read full article

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑