By JOSH ARAK and KENTARO KAJI
Systematic experimentation — in the form of A/B and multivariate testing — has fast become embedded in the workflow and culture of teams across The New York Times: Product teams test new features; newsroom editors test the framing of individual stories; and marketing tests to learn what it takes to turn casual visitors into subscribers.
Like many organizations, we debate the weight given to instinct versus data-driven decisions and grapple with the best ways to measure long-term success. But we also recognized the need to establish a common language, framework and set of tools for running experiments across The Times.
Two years ago, our landscape of testing technologies was extremely fragmented, with five separate testing platforms and five separate methods of tracking and reporting being used to accomplish the same task. Teams used different processes and methodologies, which made it difficult to report and interpret consistent results.
To address these issues, we convened a team of developers and analysts to research ways to simplify and standardize our testing strategy. It was out of these efforts that our new system, known as ABRA, was born.
ABRA, short for A/B Reporting and Allocation architecture, was developed with two key goals in mind: first, to provide an embedded, light-weight framework that allows for flexible testing on both the front-end and back-end; and second, to tie that framework directly into our data infrastructure to ensure accurate, fast and flexible reporting.
Today, ABRA supports a range of experiments at The Times, including the replatform and redesign of desktop and mobile home screens, a well as experiments in paywall innovation and personalization, to name just a few.
Built, Not Bought
Our work on ABRA began in 2015. After meeting with industry counterparts, reviewing available vendors and determining the vast array of tests we needed to support, one thing was clear: Our most important tests (e.g. home page redesign, meter rule and paywall testing, ad pattern changes) required experiments to live deeper in our infrastructure than vendor solutions would allow. So we decided to build ABRA ourselves.
Before we started coding, we researched which capabilities from vendor products we wanted to preserve and also how to embed a custom framework deeply into our technology stacks. That led to several core components around which we could focus our development efforts:
- User ID Management
- Allocation Framework
- Targeting & Segmentation Capabilities
- Integration & Deployment
- Data Model: Collection & Aggregation
- Reporting Interfaces
The project was then split into two major development initiatives: One focused on the development of the allocation framework, which allows for the creation of experiments and the allocation, management and tracking of test visitors. The other was a dedicated data pipeline and tools to support the reporting needs of our users.
How it Works: Allocation Framework
The snippet, which usually lives inline and above the fold, initializes DOM attributes on the document root based on the user’s randomly-assigned variations. Using these attributes, CSS rules and other inline scripts can change the display and styling of content before first paint. It also installs a library with facilities for reporting variation assignments and user actions to our data pipeline.
The variations a user gets aren’t truly random; in fact, we use a consistent hashing algorithm to persist A/B test groups, hashing each experiment ID with the user ID and mapping the result to one of a list of variation IDs. (This common technique is better described in a 2007 paper by Kohavi et al., under “Hash and partition.”)
We also knew some experiments would have to span multiple subdomains of nytimes.com, but we wanted to avoid adding more to the already overburdened nytimes.com cookie. Hashing on the client requires only a single fixed-length user ID cookie, which (unlike other forms of client-side storage) is shared by all subdomains of nytimes.com, and a configuration object that’s identical for everyone and therefore trivially cacheable at the edge.
And in order to avoid interdepartmental gridlock, we wanted to enable the teams responsible for all web products to conduct experiments independently of one another, while allowing teams whose products span multiple properties (our marketing assets, for example) to run tests across all of them at once.
This implementation has spread across The New York Times’s web products, including to stacks that might theoretically benefit from more tailored versions of ABRA — products that operate without an edge cache, for example — but which find the inline snippet good enough. Future iterations may integrate more tightly into the underlying stacks, since there seems to be a move towards consolidating previously disparate products like desktop and mobile core news. The future is bright!
How it Works: Data Pipeline and Reporting UI
The sheer number of data sources being used in our old system made it difficult and slow for analysts to draw consistent, meaningful insights from their experiments. We knew that one of the most impactful improvements that could come out ABRA was improving data accuracy and giving experiment owners easier access to data and insights.
Our first goal was to develop a unified data pipeline to ensure we were collecting all experiment data in a single place. With this data pipeline we were looking to accomplish a few key objectives:
- Provide a single source of data for all experiments being run across the company
- Collect and store hit-level experiment data that our analysts can use for deeper exploration into experiments
- Develop rollups of our hit-level data and join it to key data sources for a summary of critical engagement and revenue metrics, surfaced via a fast, friendly UI
Next we developed a user interface for end users to access key metrics on their experiment, where we apply a Bayesian methodology for determining lift and confidence. For more complex experiments, the analyst creates a custom report from the ABRA data pipeline to incorporate the necessary metrics and audience segments of interest.
What’s to Come?
With the expanded use of ABRA across the organization, there are a number of enhancements to come:
- Deeper integration into our native and replatformed web apps
- More flexible and detailed reporting for teams running experiments with ABRA, surfacing the most relevant metrics and audience segments for their goals
- Smarter optimization methods including contextual bandits, for experiments that are evaluated in real-time and optimized perpetually
- Expanded targeting capabilities, for delivering experimental and optimal treatments to behavioral and geographic audience segments
ABRA: An enterprise framework for experimentation at The Times was originally published in Times Open on Medium, where people are continuing the conversation by highlighting and responding to this story.