In a special guest post Anders Eriksen from the #bord4 editorial development and data journalism team at Norwegian news website Bergens Tidende talks about how they manage large data projects.
Do you really know how you ended up with those results after analyzing the data from Public Source?
Well, often we did not. This is what we knew:
- We had downloaded some data in Excel format.
- We did some magic cleaning of the data in Excel.
- We did some manual alterations of wrong or wrongly formatted data.
- We sorted, grouped, pivoted, and eureka! We had a story!
Then we got a new and updated batch of the same data. Or the editor wanted to check how we ended up with those numbers, that story.
…And so the problems start to appear.
How could we do the exact same analysis over and over again on different batches of data?
And how could we explain to curious readers and editors exactly how we ended up with those numbers or that graph?
We needed a way to structure our data analysis and make it traceable, reusable and documented. This post will show you how. We will not teach you how to code, but maybe inspire you to learn that in the process.
Making your folders shine: organize and stick to it
Our first step was to organize our files in a common structure across all projects. We needed a folder for the raw data, one for the output files that our analysis produced, and we needed a place for code and documentation.
We created a new folder (“thelaboratory”) with a folder for each year, month and the project itself.
The project folder consists of a
data folder for all the raw data files (xls, csv etc).
There is also an
output folder where all files produced by the analysis ends up.
The files in the data folder are never modified. This is important. You want to keep the files as they were when you downloaded them, got them by e-mail or FOIA’ed them.
If you need to change them, create a new file (with a script/notebook) with a new name and dump the new file in the output folder.
That way you can always go back to start and redo your analysis. And you always know that you haven’t altered anything manually in the source files. Because we don’t do that, do we?
Know what’s in them
The only thing we edit in the data folder are the filenames. This is to make sure we know who provided it to us.
Often a file you download or get by mail has a name like
report123.xlsx . We usually rename that to something like
There is one exception: If you get multiple files, like
report2.xls etc that later will need to be parsed together, then you should create a subfolder in the data folder with a name following the name standard above and put your files in the folder. That makes it easier to parse all the files later on — they’re all in the same folder.
The journalist’s notebook revisited
Once you have gathered all your source data files and created your output folder you are ready to interview the data and find the stories.
Just as we did 10 or 20 years ago, we want a notebook available when doing interviews, so we can jot down important stuff: how things were said; important quotes; important facts.
You could of course keep an analog notebook next to your laptop and write down every single operation you do with the Excel-file you got to find a story. But let’s make it a bit more automatic, traceable, reusable and documented.
Say hello to Jupyter Notebooks
Documenting for the future you
Jupyter is an application that runs in your browser. It makes it easy to code, annotate the code, visualize data, comment and structure your analysis.
It is like a living document with code, comments, analysis and graphs all in the same place.
A jupyter notebook of your project allows you to:
- Document to your self and others how you went about finding that Pulitzer-worthy story. So you know how to redo it in a year.
- Automate parts of the data cleaning, reusing code you have created before.
- Repeat the cleaning and analysis in seconds when your source sends you an updated spreadsheet. Without having to do all the 132 steps in Excel manually.
- Be open about how you found those striking numbers for your story.
- Send the complete research to your editor, lawyer, curious reader or the interwebs.
Jupyter Notebooks is often used by scientists — especially data scientists.
Getting the tools you need
To get started with analysis using our folder structure and Jupyter you need to add some animals to your stack.
First you should get Anaconda: This is called a “data science platform” and is the easiest way to get you started with Jupyter. It includes the programming language Python and the data science library Pandas in addition to Jupyter Notebook app itself.
Getting all this installed is pretty easy, and there are several guides out there.
We are not going to guide you through your first Notebook in this post. If you are not familiar with Jupyter, Python and Pandas you should take your time and follow Ben Welsh’ highly recommended step-by-step guide “First Python Notebook”
How we structure our notebooks
Every data scientist and journalist probably has their own way of structuring their notebooks. We will show you our preferred way.
Split and conquer
Putting all your importing, data cleaning and analysis in one single notebook (think of it as a webpage) might get messy. So we often split it in two or more notebooks: cleaning and analysis.
To be able to quickly rerun your scripts in the right order later, numbering them is smart. So you might have these notebooks (
ipynb is the extension of Jupyter notebook files)
/myprojects/2010/10/thebigstory/01 Import and clean.ipynb /myprojects/2010/10/thebigstory/02 Analysis.ipynb
The important meta information
Don’t mess around later trying to remember where you got the data, from who and when. Add maybe a sentence about your data: add the details to the top of your notebook.
And of course add a heading so you know what the notebook is all about.
Keep it tidy
I know, it is not easy once you dig into the analysis, but try to do this:
- Add sub headlines to structure code bits and make your notebook more readable.
- Add Markdown cells with info about what you find out in the process, important notes on why you chose to do your analysis this or that way.
- Keep the code mostly self explanatory: use readable variable names, indent correctly and keep lines short.
- Prefer readable code to code that performs milliseconds faster.
- Comment code that you struggled with, and add links to source of the code bits. (Yes, to stackoverflow.com)
- Comment the visualizations. What do you see in the graph? What can we conclude (or not conclude) from the graph? What does it tell you?
Remember: you might create a PDF or website out of the notebook for people to read later. And most importantly: you should be able to read and understand it in yourself two years from now.
Track your changes
Jupyter Notebook includes a menu item to save a checkpoint which is a version of your notebook at the time of saving. You can use this to save before you start to mess around with your code in innovative ways.
But as far as we have seen Jupyter only allows for one checkpoint.
If you (like us) want to keep track of changes and share notebooks among team members, you should check in all the code/notebooks in a version control system like Git. We use GitHub for our code.
The added value of using GitHub is that you can view all your notebooks rendered online. Which is nice for sharing and browsing old notebooks without having to install Jupyter.
In a future post we’ll get more into detail on importing, cleaning and analyzing data in Jupyter with Pandas.
In the meantime you can preview some of our notebooks made for the Data Skup Conference 2017 in Oslo on our GitHub repo. (code in Python, comments in Norwegian…)
And of course: do not forget the great crash course from Ben Welsh in LA Times on making your first Python/Pandas notebook.