Quick and Statistically Useful Validation of Page Performance Tweaks

This post was originally published on this site


By JUSTIN HEIDEMAN

Improving page performance has been shown to be an important way to keep reader’s attention and improve advertising revenue. Pages on our desktop site can be complex and we’re always looking for ways to improve our performance. Since 2014, when our desktop site was last rebuilt, there have been big changes in client-side frameworks with great improvements to performance. Some of those site improvements will take us some time. We wondered if in the shorter term there are some smaller changes we could implement to make www.nytimes.com more performant.

A quirky problem we ran into was how to effectively measure modest performance changes when a page has many assets of variable speed and complexity that can impact its performance. We used the magic of statistics to compensate for the variability and allow us to get usable, comparable measurements of a page’s speed.

In order to make our site faster, we have to figure out what is slow first. We do fairly well with much of the attainable low-hanging page performance fruit: compression, caching, time to first byte, combining assets, using a CDN. Our real bottleneck is the amount and complexity of the JavaScript on our pages.

If you look at a timeline of a typical article page in Chrome’s Developer Tools, you’ll see that there is an uncomfortably long gap between DOMContentLoaded event and the Load event. Screenshots show that the page’s visual completion roughly correlates with the Load event. The flame chart shows a few scripts that take a fair amount of time, but there isn’t any one easily fixable bottleneck that could be removed to make our site faster. Slow pages are death by a thousand protracted cuts. Some of those cuts are our own doing and some of them stem from third-party assets. The realities of the publishing and advertising world demand that we include a number of analytics and third-party libraries, each of which impose a performance cost on our site.

In order to start weighing the impacts and tradeoffs of the logic and libraries we have on our page, we wanted to get real timing numbers to attach to potential optimizations. For instance, in one experiment we investigated, we wanted to be able to know how much time is consumed rendering the ribbon of stories from the top of the story template and how much faster the story template could render if the ribbon were to be removed.

One way to do this would be to use the User Timing API and measure the time it takes from when the ribbon initializes to when its last method completes. This works for when we have things we control and can easily modify the code for. It’s not as easy when we want to weigh the impacts of a third-party library because we can’t attach timing calls to code we don’t control. There is another problem with this approach: instrumenting one module provides an incomplete picture. It doesn’t show the holistic down-the-timeline impacts that an optimization may have, or account for the time it takes a script to download and parse before it executes.

An even more fundamental problem is that any type of performance measurement on a page will give different timing values each time a page is reloaded. This is due to fluctuations in network performance, server load, and tasks a computer is doing in the background, among other factors. Isolating a page’s assets might be one way to solve it, but that is impractical and will give us an inaccurate picture of real-world performance. To correct for these real-world fluctuations and attempt to get usable, comparable numbers, we ran our timing tests multiple times, collected the numbers and plotted them to make sure we had a good distribution of results.

The values that the graphs show aren’t specifically important, but the shape of the curve is; you want to see a clear peak and drop off of values to indicate you have enough sample points and are distributed in a logical manner. We found using the median of the collected timing values to be the most useful comparison number for our tests. The median is typically most useful when a dataset has a skewed central tendency, like ours, and is less susceptible to outlying data points.

Gathering enough numbers by hand (e.g., reloading Chrome, writing down numbers) would be tedious, though effective. We use some open-source browser automation tools for functional testing of our sites, but they require some careful setup and retrieving page performance numbers out of them is not straightforward. Instead, we found and used nightmare, an automatable browser based on Electron and Chrome, and nightmare-har, a plugin that gives access to the http archive for a page full of useful performance information.

Here’s what a simple script looks like to get the load event timing for a page:

We want to get multiple timing numbers to be useful, so we need to loop the test. Unfortunately, with the HAR plugin, nightmare acts erratic when being looped, so our solution is to wrap the Node script with a simple loop in a shell script, like so:

You might notice in the above script, we’re actually testing two URLs. This is due to another intricacy of the oscillations of page performance. We found that if we ran two tests sequentially, e.g., 40 runs of control, 40 runs of our test, our numbers were perplexing and sometimes did not match what we expected from our optimizations. We found that even an hour separation in time could produce variations in timing that would obfuscate the performance changes we were attempting to see. The solution is to interleave tests of the control page and the test page, so both are exposed to the same fluctuations. By doing this, we were able to see performance deltas that were consistent across network variations.

Put it all together, let it run (go get a cup of coffee), and you’ll get two columns of numbers, easily pastable into the spreadsheet of your choice. You can then use them to make two nice histograms of your results, like so:

These are still not real, in-the-wild numbers, but they are easier to attain than setting up a live test on a site, which we’re planning to do in the near future. Quick testing like this gives us the confidence to know that we’re on the right track before we invest time in making more fine-grained optimizations.


Quick and Statistically Useful Validation of Page Performance Tweaks was originally published in Times Open on Medium, where people are continuing the conversation by highlighting and responding to this story.

Comments are closed.

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑