Improving Our Video Experience

This post was originally published on this site

Part Three: Accessibility


Illustration by Jason Fujikuni

This is the third and final post in a series about the progress and achievements of our video delivery platform. The previous posts are Part One: Our On-Demand Video Platform and Part Two: Our Live Streaming Platform.

Improving the accessibility of the video experience on our web properties was one of our main goals for 2017. In this post, we will talk about how we built closed captions support into our video delivery platform and how we made the control set on our video player more accessible.

Why Closed Captions?

The goal for adding closed captions was to make our videos accessible to more viewers in more scenarios. Some viewers are deaf or hard of hearing and need closed captions to be able to fully understand the content of our videos. Some are used to watching muted, autoplaying videos on social media feeds and want the same experience when browsing or our mobile apps. Some might use closed captions to help improve their language skills. Not providing closed captions to videos essentially locked these and other viewers out of experiencing the rich quality of videos produced by video journalists at the New York Times.

Implementing closed captions also creates some new possibilities for our video experience. Closed captions and subtitles use the same underlying technical implementation for videos on the web — in our case, Web Video Text Tracks (WebVTT) with cross-browser support aided by hls.js — which means we’ve opened the door to adding subtitles in languages other than English in the future. Creating video transcripts based on closed captions also now becomes an easier task, which would improve the searchability of our video library.

Designing an Intuitive Caption-Publishing UI

We had several considerations to keep in mind:

  • As with our encoding integration, we wanted the process to be as straightforward as possible. We wanted to require the fewest number of mouse-clicks to generate captions, accommodate varying caption-generation speeds, have the ability to copy edit the captions, and control when the captions are visible to the viewer (the ability to “publish” and “unpublish” captions).
  • The journalists should have the option to generate captions without depending on the encoding process. Though the captions could be generated when encoding the video, they could also be generated at any time following the initial encode without having to re-encode the asset.

In the service we designed, the journalists can generate captions in various parts of the publishing process. They can opt to create captions as they upload and encode a video, or they can generate them independent of any other action if a given video asset already contains previously encoded renditions.

The video team can also choose to generate captions in any one of three ways:

  1. Machine-Generated Captions: As the title suggests, these captions are generated by a computer, which makes them less than accurate (usually at about 70% accuracy) and might require extensive editing. This option is used almost exclusively for breaking news content that needs to appear on the site as soon as possible.
  2. 2-Hour Captions: This option involves a human transcriber and guarantees a 2-hour turnaround during normal business hours. The journalists are only required to do minimal editing changes with those.
  3. Same-Day Captions: This is the slower human-generated option which is mainly used for non-urgent content.

If the journalists already have a captions file for the video on their computer, they can upload it and forego the above caption-creation step.

Once the captions are generated or uploaded, the CMS UI reveals a ‘Review’ button that allows the journalist to review the captions with our external provider. They can change the timing, copy edit the text, and once satisfied, finalize their edits and return to the main UI.

After copyediting, the captions are ready to be published. The journalist simply presses on the button to make the captions live, and they become available to the video player within seconds. If, for some reason, they are unsatisfied with the captions, they can always press another button to unstage the captions and remove them from the player.

Illustration by Jason Fujikuni

Generating and Serving Closed Captions

To add closed captions support into our video publishing workflow, we created a microservice called video-captions-api. Behind the scenes, we use third-party vendors like 3Play Media to generate captions from videos and Amara for editing captions.

We had two major goals in mind when creating and integrating video-captions-api into our existing video publishing pipeline. First, we wanted to make this microservice agnostic to the newsroom workflow, which is always subject to change, to make it more stable and easier to open source. Second, as speech-to-text mechanisms and technologies evolve at a fast pace, we wanted the system to provide us flexibility in the vendors we use to generate captions, so we reduce the possibility of vendor lock-in, and have the potential to extend.

In the implementation of video-captions-api, we created an abstraction called provider, which can be either a wrapper of a vendor API client or a module that provides some basic functionality. Each provider performs exactly one operation on captions. For example, the 3Play provider wraps a client for the 3PlayAPI and is responsible for generating captions from a video file. To generate captions from a video file, we create a job with the provider set to 3Play, and when this job is finished, we fetch the generated captions file from 3Play API and upload it to our GCS bucket.

Similarly, to review and edit an existing captions file, we simply create a job with the provider set to Amara, and when the editing is submitted, we mark this job done. The lifecycle of a captions job for the CMS then consists of several jobs on video-captions-api. From the CMS’s perspective, the status of a captions job can be “generating,” “generated,” “being reviewed,” “reviewed” and potentially more, but from video-captions-api’s perspective, there are only two statuses for a job: “in progress” and “done.”

To serve a captions file with a video, we just need to include captions metadata in the M3U8 manifest of that video. On the player side, hls.js parses the manifest and retrieves the captions file. Since we need the ability to publish videos and captions separately, we need to be able to update the M3U8s after videos are already published.

Thanks to our service that generates adaptive bitrate streaming formats on the fly, this becomes simple and natural. Once a captions file is published, that file is uploaded to a GCS bucket. Simultaneously, the cached M3U8 manifest of this video is purged. When the next user makes a request for this video, we generate a new M3U8 manifest — with all of the video renditions previously defined as well as the captions content — on the fly. That manifest then gets cached until the next time we update a video rendition or the closed captions.

Video Controls A11y

After landing the closed captions project in production, we decided to do an accessibility audit of our video player’s UI using a keyboard and screen reader for navigation. Our goal was for this mode of interaction with our video player to be as smooth as it is using a mouse. We quickly realized that this interaction model was broken. Fortunately, we were able to take a few simple steps towards fixing the player’s keyboard and screen reader UX.

In order to make the controls accessible, we needed to update the video player’s markup with semantic HTML. We replaced <div>s and <span>s that had masqueraded as buttons with actual <button> elements. For slider controls (e.g. the volume slider), <input type=”range”> was a perfect fit. We also added ARIA attributes and the HTML title attribute to the markup of our controls to provide more context for screen reader users.

<div class=”play”></div>


<button class=”play” title=”Play” aria-label=”Play”></button>

The big win here was being able to take advantage of the accessibility features that browsers bake into <button>, <input>, and friends, such as focus handling.

<span class=”volume-slider”></span>


  title=”Adjust Volume”
  aria-label=”Adjust Volume”

For cases where using semantic elements wasn’t possible, we continued using <div>s but added ARIA attributes and tabindexes to make these elements behave as the semantic versions would.

With semantic elements in place, we added keydown event listeners to the control elements to allow them to support keyboard interactions. These event listeners closely mirrored the existing click handlers we already had in place.

For example:

const video = document.querySelector(‘video’);
const playButton = document.querySelector(‘’);
playButton.addEventListener(‘click’, (event) => {;
playButton.addEventListener(‘keydown’, (event) => {
  if (event.key === ‘Enter’ || event.key === ‘ ‘) {;

After some discussion with our designers, we also decided to implement a custom focus ring styling on our control set that would be consistent across all browsers. The CSS to implement this was straightforward:

/* disable the browser default styling */
button, input, textarea, a {
  outline: none;
/* apply our custom focus styling on :focus only */
button:focus, input:focus, textarea:focus, a:focus {
  outline: 1px solid #4A90E2;
  outline-offset: 1px;

But we had one remaining problem: Mouse clicks also focus these elements, and we only wanted to apply focus styles on keyboard navigation. Our solution was to keep track of the currently focused control element (if any) that had been focused by the keyboard and to prevent focus events from firing on mouse events:

const controls = document.querySelectorAll(‘.controls button’);, (element) => {
  element.addEventListener(‘mousedown’, (event) => {
    // stops the `focus` event from being fired on mouse down events
    // so that the focus ring does not appear on mouse interactions

With these improvements in place, the control set on our video player has taken a big step forward for accessibility.

You can check out an example of our improved video experience with closed captions and accessible controls here.

Next Steps

Looking ahead, we are keeping accessibility in mind while planning future improvements for our video player. Some of the features we’d like to implement going forward include multiple language subtitle support, video transcripts, and improved keyboard and screen reader experience on our audio player and the remaining parts of our video player.

Improving Our Video Experience was originally published in Times Open on Medium, where people are continuing the conversation by highlighting and responding to this story.

Comments are closed.

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑