Getting Started with Coko

If you would like to consider getting involved with the code that we produce the you might find the following information helpful.

First, we have our own chatroom. It runs on the wonderful Mattermost platform (open source). You can find our version running here:

https://mattermost.coko.foundation

The account creation page is linked from there or you can jump direct to it from here:

https://mattermost.coko.foundation/signup_email

The main room ‘Townsquare’ is where the general chitter chatter takes place. Feel free to jump in and introduce yourself. We are a pretty friendly bunch so don’t be scared to yell out with any issues or help you might need or ideas you might have.

Next, you may wish to have a look around our code. We have a few places for you to check out, depending on your interest. As with our chat room we host our code on an open source platform – Gitlab (github is closed source). You can find our Gitlab here:

https://gitlab.coko.foundation

You can create a new account from the Sign In link, or access it directly here:

https://gitlab.coko.foundation/users/sign_in

You don’t need an account to access any of the code, but you do need one to make any merge requests (same thing as a pull request on github).

As for the code…we have much for you to look at!

Editoria – this is the book production platform you may have heard about. Written in JavaScript on top of PubSweet and using the Wax editor. You can find it here:

https://gitlab.coko.foundation/editoria/editoria

xpub  – our very new Journal platform. Written in JavaScript on top of PubSweet and using the Wax editor. It can be found here:

https://gitlab.coko.foundation/alf/xpub-demo

Wax – this is the editor (a web based Word Processor really) we build on top of the substance libs. Written in JavaScript. We are doing a lot of work on this at the moment. It is of use as a standalone app, but also good wrapped up in PubSweet to make your own bespoke platform. You can find Wax here:

https://gitlab.coko.foundation/editoria/wax-pubsweet

INK – INK is our framework for managing file conversion, entity extraction, content enrichment etc etc. It consists of an API (written in Rails) and a client (written in JS, the client is generally used for admin purposes). You can find the api and the client here:

https://gitlab.coko.foundation/INK/ink-api

https://gitlab.coko.foundation/INK/ink-client

INK steps can be found here:

https://gitlab.coko.foundation/INK

XSweet – XSweet is our file conversion scripts for MS Word to HTML. Written in XSLT. You can find them here:

https://gitlab.coko.foundation/wendell/XSweet

PubSweet – and lastly, our decoupled CMS, the app that enables us to build platforms and reuse all these juicy components, is to be found here:

https://gitlab.coko.foundation/pubsweet/pubsweet

As you can see, we have a lot going on! Many products in play. If you would like to learn more please jump into Mattermost and say hi! We welcome code contribs, ideas to improve the technologies, questions about what we are trying to do – or anything else you have to say!

INK – the file conversion engine

For the past 8 months we have been been building INK – the open source file conversion and transformation engine for publishing.

INK is now nearing 1.0, ready in the next weeks. In anticipation of the first major release we thought you might to know a little more about what INK does and why.

INK has been built with two major use cases in mind:

  1. Publishers – publishers need to automate all manner of operations on files (conversion, enrichment, format validation, etc). INK does all this and can be integrated with any current technology stack the publisher uses.
  2. File conversion pro’s and production staff– the people who love staying up all night perfecting file transformations. INK is a job management framework into which you can plug any action you want taken on files, create recipes, generate reports and more.

Lets look at these needs a little closer.

INK and Publishers

Publishers need to do all sorts of things to files. The highest value need right now is to automate file conversion from one format to another. Most publishers currently  ‘automate’ file conversion by sending MS Word documents to external vendors which is both costly and slow. Adding to these inefficiencies, it can be painful when there are errors introduced by the file conversion vendor and the workflow required to correct those errors.

We built INK so that Publishers could automate these conversions and generate reports to measure accuracy and speed. INK supports the Publishers workflow by acting as an ‘invisible’ file conversion service. In these situations you push a button and get a result. INK can be integrated into your current workflow with minimal hassle since it uses APIs. Because INK is open source, Publishers can either set up their own instance of INK, or they can use INK as offered by a service for a small fee (we are currently talking to some service providers to make this kind of hosted version available). It could also be possible for several smaller publishers to set up a shared instance of INK to lower costs even further.

As mentioned above, integration with existing softwares is easy. We have, for example, integrated INK with the open access monograph production platform – Editoria – as you can see below. The integration comes in the form of a button that says ‘Upload Word’. Uploading a Word doc in this instance will send the document to INK and return beautifully formatted and structured HTML to Editoria and ‘automagically’ load it into a chapter. All done without the user knowing a thing about file conversion.

In other contexts you may require production tools as well to QA conversions. In this case it is very simple to set up an tightly integrated production environment connecting INK to, for example, a QA editing environment. Everything you need to make your production staff happy (see below for how INK helps troubleshoot file conversions).

INK and File Conversion Pro’s / Production Staff

It is a simple truth that you cannot have good file conversions without some file conversion pro, somewhere, doing the initial hardwork for you. This is because file conversion is not just a science, it is an undocumented art!

INK helps these talented artists help you in 3 critical ways:

  1. Easy to build conversion pipelines – INK enables production staff to construct file conversion pipelines through a simple UI. This means they can assemble a new pipeline, reusing previously constructed conversion steps, in (literally) a matter of minutes. This flexibility hasn’t yet been available in the publishing industry. Most file conversion pipelines are hard coded which makes them very difficult to optimize, but it also makes it very difficult to reuse any part of the pipeline for other conversions.
  2. Reusable steps – INKs pipelines are built up of discrete reusable steps. This is the magic behind INKs philosophy for reuse. File conversion specialists can build these steps very easily (we have clear example documentation) and then use these steps in as many pipelines as they wish. Steps can be wholly new code in any language, leverage existing services via APIs or run system processes. These steps, once built, can be shared with the world or kept private. Our hope is to build up a shared repository of reusable steps for every need that a publisher may have. This would assist us all by reducing the possibility of duplicating effort, and enable us a community to spend the time optimizing conversion steps rather than building the same old hard coded conversion pipelines over and over again.
  3. Troubleshooting conversions – INK has a very sophisticated way of managing file conversions and exposing the pipeline results through a clean open API. INK also logs and displays errors to assist in troubleshooting. That means file conversion specialists or production staff can inspect any given conversion and work out exactly where a problem may have occurred and why.

Conversions

Currently we have developed INK steps to achieve the following:

  • Docx to xHTML (a very sophisticated conversion that we have been working on for over 6 months)
  • HTML to PDF
  • EPUB to InDesign XML (ICML)
  • Docx to PDF
  • HTML to print-ready, journal-grade PDF

In the works are the following:

  • Docx to JATS
  • LaTeX to PDF
  • HTML to JATS
  • R Markdown to Docx
  • Markdown to HTML
  • HTML to Markdown
  • EPUB to print ready book formatted PDF
  • HTML to DITA XML
  • EPUB to Mobi
  • Docx to DocBook XML

and more! INK itself, and all steps we produce, are open source (MIT license).

Its not all about conversions

INK isn’t only about conversions. Reusable steps can be written to mine data from articles, automatically register DOIs, automate plagiarism checks, normalize data, validate formats and data, link identifiers, syndicate, and a whole let more. One of the most important use-cases ahead of us, we think, is to start parsing and normalizing metadata out of manuscripts at submission time and then disseminating to third parties – reducing the time and effort for processing research and improving early discovery of preprints or articles. A perfect job for INK. We will be moving quickly on to these use cases after our initial file conversions are in place. You should see rapid progress on these other file operations within the next month or so!

Features

There is a lot to the INK universe as it is a sophisticated software. Here is a short break down for the technically minded:

INK (API SERVICE)

  • HTTP Service API
  • Resource management
  • Async request management
  • Multi tenet service architecture
  • JWT authentication
  • Step abstraction (leveraging GEMs)
  • Recipe management
  • Web Socket support
  • Event subscription during recipe execution, meaning any client using the INK API can update their users on the progress of execution in real time.

INK (DEMO ClienT)

  • Login
  • UI Recipe creation (including selecting the steps from an automatically populated, searchable dropdown of available steps on that INK instance)
  • Public and private recipes
  • Editing a recipe from the UI
  • An updated recipe view with clearer step names, and with descriptions
  • Users can immediately see the file list belonging to each step as it completes.
  • Users can see download each file individually or together as a .zip file.
  • Administrators can get a status report of services INK uses, so it’s easy to spot potential issues that may affect users.
  • A list of user accounts – it’s basic at the moment, and will evolve to account management.
  • A list of available steps. In the future, administrators will be able to enable and disable execution of these steps from this panel.

As you can see, INK has come along a long ways from a proof-of-concept and we’re excited about what it can bring to the domain.

We are currently working on the following features:

  • downloadable log and report generation
  • single step execution (currently steps are nested in recipes)
  • synchronous execution
  • http recipe parameters
  • http step parameters
  • semantic tagging of outputs

Please get in touch if you’re interested in finding out more or working with us to improve INK, implement it, or build and share steps! INK 1.0 due by the end of June!

How we’re building the ‘mountain chalet’ of complex conversions

When scaling great heights, sometimes you need a place to rest before moving on.

That’s one analogy for XSweet, a toolkit under development by the Coko Foundation. It offers a set of stylesheets for extraction and refinement of data from MS Office Open XML (.docx) format, producing HTML for editorial workflows.

XSweet developer Wendell Piez offered that parallel in a recent presentation at JATS-Con 2017. The two-day conference centers around Journal Article Tag Suite (JATS), an XML format for marking up and exchanging journal content.

The toolkit offers a new path to document conversion — instead of heading first to a format like JATS, XSweet delivers the document into HTML, the lingua franca of the web. Once the document is in HTML, it can be processed in a web-based workflow, progressively improved using browser tools and easily go out to other formats from there. What was once a tedious trek becomes a journey where collaborators focus on what matters — editing and determining the details of publishing. Details of his talk are available as part of the conference proceedings.

XSweet offers “refuge” from the slog of conversion because instead of immediately trying to produce structured JATS from unstructured Docx, it produces a faithful rendering of a Word document’s appearance translated into a vernacular HTML/CSS.

In a 45-minute session titled “HTML First? Testing an alternative approach to producing JATS from arbitrary (unconstrained or “wild”) .docx (WordML) format,” Piez walked the audience through a mini-editorial process: taking a Word docx file sent by an author and pushing it through XSweet to produce an HTML file.  “The few hours it took me to produce BITS from the docx original, that was both faithful and also better for further editing and application, were minimal in comparison to the time we were then able to spend on things that really mattered,” Piez said.

Piez is pleased about how the talk went.  “A number of audience members approached me afterwards, many of whom had themselves looked this problem in the face before and were willing to confirm the sense of the problem and approaches to it.”

Sowing the seeds for change in scholarly publishing

The promise of open science to improve the speed, transparency and completeness of research sharing has attracted a lot of innovators and developers creating new, open source technology solutions. All too often, though, technologies are built by organizations that see themselves as competitive with one another and work at cross purposes.

We’re focusing on changing this culture. That may seem a strange statement from a Foundation whose initial work has already launched open infrastructure projects such as PubSweet and INK, but bear with us. Coko is working to seed a new ecosystem of open source projects, tools and platforms that work together.

We envision building an evolving network of modular, interoperable, flexible and reusable open source projects that facilitate rapid, transparent and reproducible research and research communication for the public good. Rather than remaining independent and siloed, these projects will share resources and learn from each other, creating an open science infrastructure. Coko is striving to create a healthy ecosystem of projects that can thrive and work with each other to solve the many problems and opportunities that face STEM publishing today.

Our first small step in this direction — which we see as a giant leap — is pulling together complementary projects to create an Open Source Alliance for Open Science. This federation will actively work together to form the  ecosystem, agreeing on best practices that emphasize generosity and openness. The idea is to create a common pool of resources whose development is driven by community needs. Code is shared, so are tips for funding applications, report writing and outreach (etc).

An apt analogy is a community garden: plants that grow well together in common soil are seeded, grown, harvested, shared and plowed back into the land. Individual “plots” may be tended by the gardeners who are most adept at cultivating the seedlings, yet cross-pollination and resource sharing where appropriate are encouraged. The gardeners work in a common space, find territorial solutions and share fruits of the “harvest.”

One example of how we are prototyping this process is with the Substance Consortium, we helped found along with the Public Knowledge Project (PKP), SciELO and Érudit in 2016. Consortium members all use (or intend to use) the open source Texture editor, which helps publishers improve structured documents without having to mess with the underlying markup of XML (extensive markup language). The Consortium started as a way to recognize that organizations using the tools as critical infrastructure have a responsibility to contribute to their upkeep. To that end, Coko has played a foundational role in establishing the consortium, as well as putting energy and funds that contribute to the sustainability of Substance and the codebase.

As another example we introduced the innovative new project, Stencila, to funders — and then stepped aside. Typically, in a competitive environment, smaller projects that are desperate for initial funds may be co-opted by larger ones who overshadow the smaller organization and take a large cut of the funding. The larger project may vacuum up the credit without adhering to attribution best practices. Instead, in a demonstration of good faith, we coached Stenci.la through the funding process and made the direct introductions to funders. Stepping aside to enable Stenci.la to operate as they need to, with the funds they need, and receive the recognition they duly deserve.

Our efforts to cultivate these projects differ from the typical competitive model where organization see what others are doing, then throw shade on the newcomers by claiming to be building the exact same thing. This land grab results in whoever has the superior budget, PR and grant-writing staff, and stronger name recognition “winning,” whether or not they intend to actually create the product, build it well, or share it in a meaningful way. This highly competitive landscape discourages healthy open source communities forming around projects and meaningful, productive, inter-project collaboration.

The garden model will give smaller projects a chance to thrive and grow so as to avoid being co-opted or plowed under. This will create a more diverse and rich ecosystem, since many of these projects arise out of specific expertise that larger projects may not have.

To lay the groundwork for this Alliance, we’re planning a meeting May 1 in Portland, along with founding partners DAT, the Code for Science & Society (CSS) and The California Digital Library (CDL). By meeting in person, discussing initiatives and directly collaborating, we seek to generate buy-in on shared goals and open direct lines of communication between organizations. The initial meeting will garner support for shared goals and values and establish a self-sustaining community with firm attendee commitments to continue the conversation. If you’d like to participate email us at info@coko.foundation

All About INK (explained with cake)

Charlie Ablett, INK Lead Developer

INK is Coko’s ingestion, conversion and syndication environment that converts content and data from one format to another, tags with identifiers and normalizes metadata.

When an author or group of authors creates content, there is a fair bit of processing that needs to be done on the content in order to prepare it for publishing.

Typical use cases include converting Word and other proprietary formats into highly structured formats such as HTML5, XML, and ePub, and outputting to syndicated services, the web and PDF. Additionally INK can add common identifiers such as DOIs and geolocation IDs and ensure compliance with standards for content and metadata.

Frameworks similar to INK have been created and re-created in both open and proprietary domains, but INK takes it further and does it better. One of the big advantages of INK is that it is an open source framework for chaining custom processing steps together to automate some of these processes. We encourage (but not require!) the creation and sharing of steps and recipes – ordered collections of processing steps – so communities, organisations and individuals can help each other. It’s all about sharing and collaboration which is pretty much what Coko is about.

In this post, I detail how INK works, using cake as an analogy. Don’t worry, if you’re a pie person, you can still follow along as you dream of the perfect raspberry chiffon…

What does INK do?

What a great question – glad you asked.

INK is an open-ended, extensible, modular service that allows processing of files (e.g. documents) via execution of Steps. A user feeds in one or more files, usually a document, the step/s do something with the file/s in sequence, and the user gets the result. It sounds very general, and admittedly a bit abstract, because INK is meant to be flexible and customisable by anyone. Let’s break it down a bit.

Steps

Each Step contains a bit of logic that can do something to one or more files. For example:

  • convert from one format to another, such as converting a HTML document to PDF
  • clean up HTML
  • modify the images in a document (resize them, make them greyscale…)
  • translate a document to another language
  • analyse the contents of a document and generate a summary

This is just a small number of examples. Steps are intentionally open-ended.

INK and its steps are released open source, so anyone can set up their own server and run their own customised INK service. They can install whichever steps satisfy parts of their own publishing process. If there’s something they need to do to a document that’s not covered by an existing step, they can write their own and add it to their instance of INK.

Recipes

Often with publishing toolchains, there are several things that need doing to a raw document before it’s ready to publish. INK lets you chain steps together into a recipe. A user can create an INK recipe which is a pipeline of steps all in a row that need to be executed in sequence.

Execution

Think of a recipe just like you’d think about making a cake. A recipe details how one might turn raw materials (sugar, flour, etc) into a cake — but you don’t have an actual cake until you get your ingredients together, put on your apron and follow each step, one after the other.

As you’ll know, a recipe involves more than throwing everything together!

INK can execute the recipe given some files, and when all the steps are done, the user can see the results from each step that is passed to the next one. They can see if something went wrong, or check if some intermediate step in the recipe didn’t behave as expected. They may need to tweak the step logic itself, or make sure they provide the right kind/s of file/s.

How does INK work?

You might be thinking – ask a technical person to explain how some of their software works… and the answer is usually jargon-riddled and aimed towards other developers as an audience. Fortunately, I’ve been teaching developers and non-developers alike for long enough that I can manage to explain something in language that suits a wide range of audiences. Hopefully the following is clear!

INK has three main parts.

In the Ruby programming language, people can write standalone code libraries that other Ruby programs can use. These are called ‘gems’. INK uses INK step gems to detail what each step does.

A step gem contains one or more steps contained in it. An INK server might have any combination of step gems installed on it. If a step gem is installed on the server (by the system administrator), then recipes using steps contained within that step gem can be executed on that server. It’s designed this way so that someone running an INK server has control over what steps users can use.

The recipe engine is a Rails web app that keeps track of users, their recipes, and which steps are in which recipe, and in what order. It also tracks which recipes have been executed by whom, and where to find the resulting file/s for each step in the pipeline. When a user decides to execute a recipe, they provide at least one file, and the recipe engine hands it off to the execution engine.

The execution engine performs the logic in the steps in the order specified by the recipe. The results of each step are provided to the following steps in sequence (more about this in a bit).

In order to use INK, users interact with the client. Since INK is an API (a web-based service that doesn’t have a graphical interface of its own), there are other programs, such as ink-pubsweet or the INK client, that people can use to tell the INK system what to do.

Example: Docx to SHOUTY HTML.

Let’s take an example recipe and see what happens when it is executed.

The user has a recipe called “Docx file to SHOUTY HTML”, which has the following RecipeSteps:

  • Docx to HTML
  • SHOUTIFIER (a silly step that makes every letter CAPS and replaces all periods next to a letter with three exclamation marks!!! Not immediately useful, but makes for a GREAT DEMONSTRATION!!!)
  1. The user asks the system to execute the recipe, and provides a file (let’s call it the totally unoriginal name example.docx)
  2. INK checks that the recipe can be executed.
    – it’s been given at least one input file
    – all the steps the recipe asks for are available. Different installations of INK on separate servers might have different steps available, depending on what step gems the system administrator has put on that server. It’s a bit like kitchens having different equipment in them – for example, a pâtisserie kitchen would have quite different equipment than one for charcuterie. Anyone can spin up their own INK server, so it’s really up to them what step gems will deliver the most value to them or their organisation.
  3. The recipe engine queues the execution and immediately lets the user know that it’s in progress. We use an asynchronous process here, so that the user gets some immediate feedback that the execution is in progress, and they can do other things while INK takes care of processing.
  4. The execution engine takes the recipe execution request off the queue, creates a Process Chain from that recipe, and starts the execution. The execution system is always checking the queue for things for it to process, so normally this is instant. If there are some process chains still going, the execution system might wait until they are done (it depends on the pool size – how many such processes the system administrator has told INK it can do at once).
  5. The execution engine starts at the first step and executes it. It copies the input file/s into the step’s “personal” execution directory and executes whatever logic is in there against some or all of the files. In the example above, the execution engine creates the folders it needs, copies example.docx into the directory for the first step (Docx to HTML), then calls the step logic in Docx To HTML. The latter involves calling the system utility Pandoc on the docx file to convert it into HTML. The resulting HTML is written to the same sandbox directory.

    So the directory for the Docx To HTML process step will contain the original docx file (unless the step logic includes cleanup of unneeded older files, which is ideal but not mandatory) and the resulting HTML output from the Pandoc call. Then the step logic tells the framework that it’s all done, and done successfully (ie. without an error).

    If the user had provided a file that the step wasn’t expecting – e.g. a text file, or an image file – the step raises an error to say “I can’t work with this – I need a docx file please” and signals the execution engine to halt the process chain with an error. There’s no point in continuing this particular recipe if a step spectacularly fails.

  6. The execution engine continues to the next step, and repeats until there are no more steps to execute. Again, the files from the previous step are copied into the personal execution directory of the current step, executes the logic against them, and writes the result into the output directory of the current step. And so on.

    In our example, the execution engine copies the .docx and .html files from the Docx To HTML process step into the personal execution directory of the SHOUTIFIER process step, executes the logic, and will change the .html file so the content is ALL IN CAPS!!!

  7. When the pipeline has come to an end, the execution engine notifies the caller via callback (if they provided one). Callbacks are like leaving your phone number and saying “Here are the ingredients and the recipe. Call me on this number when the cake is done.” Meanwhile, you don’t have to sit by the phone and wait – you go do something else and get notified when it’s all done… and then you get to have cake! (Figurative cake in this case. INK can do a lot of things, but it can’t make literal cake. Sorry.)

If there was some sort of issue during step execution, INK keeps track of any errors raised and logs them.

INK makes the result files available for download from any process step owned by you, together as a zip file or individually. You can download the contents of the input files, or the HTML output of Docx To HTML, just to make sure it looked right.

Wrapping up

INK provides an extensible step-based pipeline framework to help make great content into a publishable format for distribution. Recipes and steps are totally customisable and can be made by organisations and individuals to suit their own requirements.

What really makes INK awesome, is that it can be suited to a wide range of processes. We look forward to hearing what delivers value to your organisation. Give it a try and let us know how you get on.

Open Source (MIT):

https://gitlab.coko.foundation/INK/ink-api

https://gitlab.coko.foundation/INK/ink-client

charlie@coko.foundation

Book on Open Source Product Development Method Released!

At the beginning of 2016, Coko was in search of a development methodology. There wasn’t one that we thought of as a good fit, so we invented our own! Originally called Collaborative Product Development, we renamed it to the Cabbage Tree Method (CTM).

The Cab­bage Tree Method (CTM, for short) is a new way to cre­ate open source soft­ware prod­ucts. With CTM, the peo­ple who will use the soft­ware drive its de­sign and de­vel­op­ment. Consequently, the process ‘bakes in’ cultural change as part of the product design process.

We have used this method now in several contexts, most notably with the University of California Press and Californian Digital Library to design and build Editoria – a monograph production platform.

Now CTM is tested and documented and the free book is available. Version 0.1 is available here – http://www.cabbagetree.org.

As always, the book and the method are free to use and openly licensed. Please check out CTM, try it out and let other people know!

Post by Adam Hyde

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑