By EDWARD PODOJIL, JOSH ARAK and SHANE MURRAY
Data is critical to decision-making at The New York Times. Every day, teams of analysts pore over fine-grained details of user behavior to understand how our readers are interacting with The Times online.
Digging into that data hasn’t always been simple. Our data and insights team has created a new set of tools that allows analysts to query, share and communicate findings from their data faster and easier than ever before.
One is a home-grown query scheduling tool that we call BQQS — short for BigQuery Query Scheduler. The other is the adoption of Chartio, which our analysts use to visualize and share their results.
The result has been more analysts from more teams being able to more easily derive insights from our user data. At least 30 analysts across three teams now have almost 600 queries running on a regular cadence on BQQS, anywhere between once a month to every five minutes. These queries support more than 200 custom dashboards in Chartio. Both represent substantial improvements over our previous model.
What problems were we trying to solve?
This effort began when we migrated our data warehousing system from Hadoop to Google’s BigQuery. Before we built new tools, we worked with analysts to come up with several core questions we wanted to answer:
- What patterns and processes did the analysts use to do their work?
- Which of those processes could we automate, in order to make the process more hands-off?
- How could we make it easier for our growing list of data-hungry stakeholders to access data directly, without having to go through an analyst?
- How could we ensure ease of moving between business intelligence products to avoid attachment to eventual legacy software?
Until the migration to BigQuery, analysts primarily queried data using Hive. Although this allowed them to work in a familiar SQL-like language, it also required them to confront uncomfortable distractions like resource usage and Java errors.
We also realized that much of their work was very ad-hoc. Regular monitoring of experiments and analyses was often discarded to make way for new analyses. It was also hard for them to share queries and results. Most queries were stored as .sql files on Google Drive. Attempts to solve this using Github never took off because it didn’t fit with analysts’ habits.
The act of automating queries was also unfamiliar to the analysts. Although the switch to BigQuery made queries much faster, analysts still manually initiated queries each morning. We wanted to see if there way ways to help them automate their work.
Query Scheduling with BQQS
Before we considered building a scheduling system in-house, we considered two existing tools: RunDeck and AirFlow. Although both of these systems were good for engineers, neither really provided the ideal UI for analysts who, at the end of the day, just wanted to run the same query every night.
Out of this came BQQS: our BigQuery Query Scheduler. BQQS is built on top of a Python Flask stack. The application stores queries, along with their metadata, in a Postgres database. It then uses Redis to enqueue queries appropriately. It started with the ability to run data pulls moving forward, but we eventually added backfilling capabilities to make it easier to build larger, historical datasets.
This solution addressed many of our pain points:
- Analysts could now “set it and forget it,” barring errors that came up, effectively removing the middleman.
- The system stored actual analytics work without version control being a barrier. The app stores all query changes so it’s easy to find how and when something changed.
- Queries would no longer be written directly into other business intelligence tools or accidentally deleted on individual analysts’ computers.
Dashboards with Chartio
Under our old analytics system, “living” dashboards were uncommon. Many required the analyst to update data by hand, were prone to breaking, or required tools like Excel and Tableau to read. They took time to build, and many required workarounds to access the variety of data sources we use.
BigQuery changed a lot of that by allowing us to centralize data into one place. And while we explored several business intelligence tools, Chartio provided the most straightforward way to connect with BigQuery. It also provided a clean, interactive way to build and take down charts and dashboards as necessary.
Chartio also supported team structures, which meant security could be handled effectively. To some degree, we could make sure that users had access to the right data in BigQuery and dashboards in Chartio.
Developing new processes
Along with new tools, we also developed a new set of processes and guidelines for how analysts should use them.
For instance, we established a process to condense each day’s collection of user events — which could be between 10 and 40 gigabytes in size — into smaller sets of aggregations that analysts can use to build dashboards and reports.
Building aggregations represents a significant progression in our analytical data environment, which previously relied too heavily on querying raw data. It allows us to speed queries up and keep costs down.
In addition, being able to see our analysts’ queries in one place has allowed our developers to spot opportunities to reduce redundancies and create new features to make their lives easier.
There’s much more work to do. Looking ahead, we’d like to explore:
- How to make it easier to group work together. Many queries end up being the same with slightly different variables and thus a slightly different result. Are there ways to centralize aggregations further so that there are more common data sets and ensure data quality?
- Where it makes sense to design custom dashboard solutions, for specific use cases and audiences. Although Chartio has worked well as a solution for us with a smaller set of end-users, we’ve identified constraints with dashboards that could have 100+ users. This would be an excellent opportunity to identify new data tools and products that require the hands of an engineer.
Shane Murray is the VP of the Data Insights Group. Within that group, Josh Arak is the Director of Optimization and Ed Podojil is Senior Manager of Data Products.
Designing a Faster, Simpler Workflow to Build and Share Analytical Insights was originally published in Times Open on Medium, where people are continuing the conversation by highlighting and responding to this story.