In a Confusing World, Context is Key — A Times Intern Sets Out to Improve Search Results

This post was originally published on this site


Illustration by Kevin Zweerink for The New York Times

The past few years have seen the rise of “context-aware” systems: technologies that can predict your intentions based on information about your environment. If you ask Google’s intelligent personal assistant, “How tall is that building?” it will use your phone’s GPS to see what buildings are near you and guess which building you are asking about. Or, if you add “pick up milk” to the Reminders app on your iPhone, you can choose to have the app remind you the next time you are within a block of a grocery store.

This use of this technology is a response to one of the most challenging problems of human-computer interaction: humans are terrible at articulating what they want in ways that computers can process. We are used to communicating with other humans who largely exist in the same context as us — they are able to infer what we mean, even if what we say is ambiguous. Algorithms that interface between humans and computers must “guess” the human’s intention, which is the crux of information retrieval (i.e. search) problems.

Predicting search queries based on news events

I took a stab at this problem during my internship at The Times last summer. In formulating my approach, I made an important assumption: because The New York Times is a news platform, it’s likely that when readers search for things on the website, they are looking for articles that are the most topical.

This may not be a fair assumption for all use cases: a high-schooler writing a term paper may be looking for archived articles as primary sources, or a print reader may be looking for the digital version of a story to email to a friend. While these use cases do occur, they are much less frequent, so relevance to current news stories was a compelling objective.

How The New York Times handles search today

Currently, the search feature on nytimes.com uses Elasticsearch, which operates on a principle called “TF/IDF” (Term Frequency/Inverse Document Frequency). The basic idea is to determine the significance of each term in an article by looking at how many times the term is repeated, then dividing this by the number of times it appears in all of the documents in The New York Times archive. That way, common terms like “said” and “Mr.” are weighted less heavily than terms like “Amazon” or “Game of Thrones.” Articles are then ranked based on how significant the user’s query terms are to each story. So while the current approach may work well for scientific and encyclopedia articles, it’s not the best approach for news stories. This is because news stories don’t have a ton of repetition.

Why we need context

Let’s look at this article published on August 1, 2017 in The Times’ Travel section that details how a restaurant at The Mohonk Resort was able to scale farm-to-table dining to hundreds of guests every night. This idea is summarized in the opening paragraph:

Can you create a farm-to-table restaurant if you have 100 tables in an 8,750-square-foot main dining room, 80 more tables downstairs, nine in a cozy lounge, 50 outside overlooking an Arcadian lake and, on certain nights, eight in a capacious kitchen?

Now, let’s look at the seven most frequently used terms in the article.

  1. Mr.
  2. Mohonk
  3. Palmeri
  4. New
  5. York
  6. Dining
  7. Restaurant

The top five terms don’t appear in the first paragraph at all. Based on just term frequencies, the article could just as easily be a restaurant review or a profile of the chef, Jim Palmeri. Word counts don’t tell the whole story so it’s very difficult to tell how relevant a story is to current events just from this information.

How topic clustering works

So how can an algorithm determine whether a story is topical or not? This is where we can use topic clustering. With this technique, an algorithm “learns” what the main dialogues are at the moment by looking at all of the articles published over a one or two week period. From this set of articles, it extracts the major topics, which are defined by a set of terms. Here are a few examples:

Topic 1: Trump, staff, house, secretary, Kelly, director, communications, chief, press, President, white, Scaramucci, Spicer, Priebus

Topic 2: Korea, north, nuclear, Pyongyang, missile, states, united, military, test, China, Korean, south

Topic 3: affordable, Americans, would, people, percent, republicans, obamacare, health, plan, coverage, medicaid, insurers, insurance, care

The above topic definitions are produced by using an unsupervised learning algorithm based on a Latent Dirichlet Allocation (LDA) model. I’m going to give a brief overview here, but if you’re interested in learning more, Edwin Chen has published an excellent explanation.

Before we begin, let’s define a few things.

  1. Each article is distributed over a set of k topics. So if we let k = 4, this article may be 60% about the Republican Party, 20% about the 2020 election, 20% about President Trump and 0% about penguins.
  2. Each topic is defined by a distribution over the entire English vocabulary. So if the 2020 election is a topic, 10% of it may consist of the term “2020,” 5% would be the term “Pence,” 5% would be the term “Cuomo,” and so on. (In reality, these numbers are much smaller, around < 0.5%.)

So, if 50% of an article is about the 2018 elections and 5% of the election topic consists of the term “Cuomo,” we would expect that 2.5% of the terms in this article are “Cuomo.”

We begin by assuming that all the articles were stochastically generated in this way, meaning each term is randomly chosen from a given probability distribution. For each article, we decide how long it will be according to a Poisson distribution and what the topic mixture will be according to the Dirichlet distribution. In Bayesian statistics, we call this the prior because it’s what we believe about the data prior to doing any analysis.

Our goal is now to backtrack to try to learn what the topic mixtures, P(T|D), and topic distributions, P(W|T), actually are.

To help illustrate this, here’s some pseudocode:

For each article a:
    For each term w:
        For each topic t:
# the proportion of terms in a that are assigned to t
Compute p(t | a)
# the proportion of assignments to t over all
             # assignments that came from w
Compute p(w|t)
# This is the probability that t generated w
Randomly reassign w to a new topic t with the
             probability p(t|a) * p(w|t)

We perform this process thousands of times until we reach a steady state. At this point, we have a pretty good idea of what the current major topics are.

How topic clustering improves search

I spent the second half of my internship developing a way to use these topic clusters to return more relevant search results. I’ll call the algorithm I built ClusterSearch.

ClusterSearch maps the user query to a topic if any of the terms were one of the top 15 terms in the topic definition. There are more sophisticated ways to do this — you could take into account the weight of each term in the definition — but this method worked well for my project.

Once ClusterSearch has determined which topics are relevant, it expands the user query with the top 15 terms from each topic’s definition. Finally it performs Elasticsearch on this expanded query, giving more weight to terms from the user’s original query.

Let’s look at an example. I did topic clustering on all the articles from the last two weeks of July over 100 topics.

Let’s say the user searched for “Simpson,” which maps to one topic definition. Here are the top 15 terms:

O.J. Simpson receiving parole was a major story at the end of July. The numbers to the right are the term weights, so the term “simpson,” makes up about 5% of the topic definition. For now, ClusterSearch ignores those weights and just uses the top 15 terms.

We expand the user’s query with these terms. So the new query becomes, “simpson former parole friend years prison robbery board nine vegas angeles hearing thursday nevada.”

We boost terms from the user’s original query so “simpson” is weighted slightly heavier than the other terms. This way, even if different user queries map to the same topic, we may return slightly different results to emphasize what the user originally entered. Finally, we run good old Elasticsearch on this expanded query.

Here are the top three results produced by ClusterSearch:

O.J. Simpson Wins Parole, Claiming He Has Led a ‘Conflict-Free Life’
Simpson, Daniel
Simpson, April

And here are the top three results with Elasticsearch:

Simpson, Daniel
Simpson, April
Simpson Manufacturing Company Inc.

Daniel Simpson and April Simpson are both reporters who left The Times over a decade ago; Simpson Manufacturing is an engineering firm that produces construction materials, but nothing has been published about them recently. In short, ClusterSearch gives us more “topical” results which are more likely to satisfy the user’s intention.

Feasibility

So how feasible is Topic Clustering? I ran 3000 iterations on 9853 articles over 100 topics. It took about 90 minutes on my CPU. However, we’re lucky because we only need to topic-cluster once to learn what the major news stories of the moment are so we can topic cluster offline. Running the algorithm once a week or even once a day is actually quite practical.

What’s next?

Search algorithms are really difficult to assess for two reasons.

  1. You don’t know what the user was looking for
    There’s no “ground truth” to judge results against. A user searching for “New York” may be looking for travel guides or they could just as easily be looking for news about the upcoming gubernatorial elections. We can only guess.
  2. Different users have different preferences
    Even if you know what users are searching for, deciding the best ranking of articles can still be tricky. Should more recent articles go first? Should more relevant articles go first? These aren’t trivial questions.

To really understand how well ClusterSearch performs, the next step would be to do some user testing.

In addition, there are a few other things we need to do before it’s deployed:

  1. Learn optimal number of topics to cluster over. This can be done through Hierarchical Dirichlet processes
  2. Determine how frequently we should run the topic clustering algorithm. This will take some experimenting
  3. Integrate ClusterSearch into The Times’ search API.

Thank You

A huge thank you to everyone at The Times I had a chance to work with over the summer; every panel, coffee chat and event left me feeling inspired and excited. A special thank you to my managers John Daniels and Boerge Svingen; my team mentor, Jeremiah Via; my intern buddy, Will Dunning; and the Women in Technology Task Force. I was lucky to have such a wonderful set of mentors and friends to guide my internship.

Get in touch!

It was an incredible summer interning at The Times. If you have any questions, feedback or want to hear about the time Nicholas Kristof caught me pointing him out to friends in the cafeteria, feel free to get in touch with me at aanab@princeton.edu


In a Confusing World, Context is Key — A Times Intern Sets Out to Improve Search Results was originally published in Times Open on Medium, where people are continuing the conversation by highlighting and responding to this story.

Comments are closed.

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑