We have an online archive, currently only available in-house, of all printed issues of the Financial Times newspaper, from the first issue in 1888 through to 2010.
Each page of each issue has been scanned, divided into distinct articles, and each article has been processed with OCR (Optical Character Recognition) to extract the source text from the image in monster XML files. As we consider how to integrate our archive into the main site, as others have done, one significant step will be to improve the OCR, which varies from random noise to almost perfect transcription – see later blog posts.
Along the way, however, there was an opportunity to create an animation from the main banner images.
At this point, our editorial department leaned forward, captivated by the little history lesson playing out before them. For those of us not quite so up on the finer points of font and layout and inter-departmental office politics, the flow of prices and the longevity of the ads provides some interest.
Along with the data on each archive article, we have the bounding box for each image on the page, so it is fairly simple to identify and extract the banner from the front page of each issue in the XML. To keep the data volume down, we restricted this to just the issue from the 1st of each month, as well as ignoring Saturdays (which have a distinctly different banner).
We used ImageMagick’s convert to crop the banner images (leaving some margin because the original scans were quite erratic in how the page had been placed on the scanner). We used FFmpeg to stitch together the multiple banners into a video (MP4) file, as well to scale it and add black padding. NB, the cropped image size needed to be an even number of pixels along each side for FFmpeg, and we chose a frame rate of 10 as a compromise of speed vs jitter (as mentioned, the scans were not very consistent).
You can see the how convert and FFmpeg were configured and used in this script.
But wait, there’s more.
123 years of “…”
Since we had the location of every in-article word included in the archive XML, it was possible to create animations of every occurrence of a specific word (where OCR had managed to catch it correctly), using the same processing as above.
Here is one such animation, a little homage to Spritz:
Which word is in focus?