Computation is no longer the preserve of science and engineering, so I thought I would share a simple computational literary analysis that I did with my daughter.
Hannah’s favorite book is Lord of the Flies by William Golding, and as part of a project she was doing, she wanted to find some quantitative information to support to her critique.
Spoiler alert: for those who don’t know it, the book tells the story of a group of schoolboys shipwrecked on an island. Written as a reaction to The Coral Island, an optimistic and uplifting book with a similar initial premise, Lord of the Flies instead relates the boys’ descent into savagery once they are separated from societal influence.
The principle data that Hannah asked for was a timeline of the appearance of the characters. This is a pretty straightforward bit of counting. For a given character name, I can search the text for the positions it appears in, and while I am at it, label the data with Legended so that it looks nicer when plotted.
The variable $lotf contains the text of the book (there is some discussion later about how to get that). By dividing the string position by the length of the book, I am rescaling the range to 0–1 to make alignment with later work easier. Now I simply create a Histogram of the data. I used a SmoothHistogram, as it looks nicer. The smoothing parameter of 0.06 is rather subjective, but gave this rather pleasingly smooth overview without squashing all the details.
Already we can see some of the narrative arc of the book. The protagonist, Ralph, makes an early appearance, closely followed by the antagonist, Jack. The nonexistent Beast appears as a minor character early in the book as the boys explore the island before becoming a major feature in the middle of the book, preceding Jack’s rise as Jack exploits fear of the Beast to take power. Ralph becomes significant again toward the end as the conflict between he and Jack reaches its peak.
But most of Hannah’s critique was about meaning, not plot, so we started talking about the tone of the book. To quantify this, we can use a simple machine learning classifier on the sentences and then do basic statistics on the result.
By breaking the text into sentences and then using the built-in sentiment analyzer, we can hunt out the sentence most likely to be a positive one.
The classifier returns only "Positive", "Negative" and "Neutral" classes, so if we map those to numbers we can take a moving average to produce an average sentiment vector with a window of 500 sentences.
Putting that together with the character occurrences allows some interesting insights.
We can see that there is an early negative tone as the boys are shipwrecked, which quickly becomes positive as they explore the island and their newfound freedom. The tone becomes more neutral as concerns rise about the Beast, and turn negative as Jack rises to power. There is a brief period of positivity as Ralph returns to prominence before the book dives into bleak territory as everything goes bad (especially for Piggy).
Digital humanities is a growing field, and I think the relative ease with which the Wolfram Language can be applied to text analysis and other kinds of data science should help computation make useful contributions to many fields that were once considered entirely subjective.
Appendix: Notes on Data Preparation
Because I wanted to avoid needing to OCR my hard copy of the book, I used a digital copy. However, the data was corrupted by some page headers, artifacts of navigation hyperlinks and an index page. So, here is some rather dirty string pattern work to strip out those extraneous words and numbers to produce a clean string containing only narrative:
Appendix 2: Text Labels
This is the code for labeling the plot with key moments in the book:
Appendix 3: Conch Shell Word Cloud Image
It doesn’t provide much insight, but the conch word cloud at the top of the article was generated with this code:
Reference—conch shell: Source image from pngimg.com Creative Commons 4.0 BY-NC.