The Land of Oz Ozzie Liu

Tracing American Culture and Language Through NYT Crossword Puzzles

Project 4 at Metis is focused around unsupervised machine learning and Natural Language Processing. But we also incorporate NoSQL for our data storage, Flask as a Python based web framework, and visualization with D3.

But what topic should I work on? I really enjoy doing crossword puzzles and I’ve been working on them in the mornings before Metis to get my brain started. Maybe I can combine crossword puzzles with data science?!

crossword Crossword Puzzles!

Contents

Background

So of course, I researched and Googled to see if anything similar has been done, and I came across a timely op-ed piece in the New York Times. Professor Charles Kruzman from UNC wrote a very interesting article looking at how the NYT Crossword puzzles say about the words we use. That’s it! I can extend on his project by using data science and unsupervised machine learning. While Kruzman’s research focused on the foreign language usage, I want to perform topic modeling on clues and answers, without manual input.

I was able to obtain about 40 years of NYT crossword puzzles with further research on the web. So I parsed the raw data from JSON and stored it locally with MongoDB. That’s 14,000+ puzzles with 1.2 million clue-answer pairs.

Goal

We know that crossword puzzles are creative and clever, but most importantly they are current. Constructors and editors are well aware of what their solvers know. In other words, crossword puzzles reflect culture and current events. And by looking at them over time, we can see how American society has shifted and how the English language has evolved.

My goal is straightforward: Given a word or phrase used as an answer in past crossword puzzles, I want to visualize how its usage and meaning has changed over time. And in doing so, get an unique view into how American culture and language has evolved.

Stack

The technology stack’s not too complicated. I store data in MongoDB, and use Flask to perform my Machine Learning. Then Flask interacts with a Javascript front end to serve the D3 visualizations. Here’s a rough diagram:

User <----> D3, Javascript, HTML
            Flask
            Python
            PyMongo
Raw Data -> MongoDB

Exploratory Data Analysis

With access to our raw data in place, I can query the database for a specific answer and see find all the clues used to describe the answer over time.

For example, here’s a snippet of the answer ELMO:

Answer Clue Date Day of Week Editor
ELMO Patron saint of sailors 10/30/2008 4 Will Shortz
ELMO Sailor’s saint 1/15/2009 4 Will Shortz
ELMO Googly-eyed Muppet 4/20/2009 1 Will Shortz
ELMO Red-haired PBS star 10/4/2009 7 Will Shortz
ELMO Muppet with a goldfish named Dorothy 12/13/2009 7 Will Shortz
ELMO Must-have toy of 1996 12/26/2009 6 Will Shortz
ELMO Citizen of Sesame Street 3/7/2010 7 Will Shortz
ELMO Adm. Zumwalt, chief of naval operations during… 7/7/2010 3 Will Shortz
ELMO Tickle Me ___ 3/28/2011 1 Will Shortz
ELMO Patron saint of sailors 4/5/2011 2 Will Shortz
ELMO “Tickle Me” doll 7/11/2011 1 Will Shortz

Date is informative and Day of Week indicate the difficulty of the puzzle. In the New York Times, Monday (1) is easiest and gets progressively harder throughout the week, with Saturday being the toughest. Sunday is usually a bigger puzzle at about a Wednesday-Thursday difficulty.

There’s a good amount of variations in clues here, but they mostly refer to the furry red muppet on Sesame Street. In the 80s (not shown), ELMO is mostly used to describe St. Elmo, the patron’s saint for sailors or the pollster Elmo Roper. Indeed, Roper only appeared once since the 90s. We immediately see the difference of how Elmo is used.

elmo Notable Elmo’s throughout time: Muppet and Patron Saint of Sailors and Abdominal Pain

Here’s another example that I enjoy with the word EURO:

Answer Clue Date Day of Week Editor
EURO Large kangaroo 5/12/1987 2 Eugene T. Maleska
EURO Large kangaroo 2/19/1988 5 Eugene T. Maleska
EURO Kind of bonds or dollars 10/19/1989 4 Eugene T. Maleska
EURO Wallaroo 3/8/1990 4 Eugene T. Maleska
EURO Large kangaroo 7/17/1990 2 Eugene T. Maleska
EURO Replacement for the mark, franc and lira 5/14/2012 1 Will Shortz
EURO New circulator of 2002 6/16/2012 6 Will Shortz
EURO Prefix with zone 6/17/2012 7 Will Shortz
EURO Replacement for the mark and franc 7/16/2012 1 Will Shortz
EURO Continental coin 10/29/2012 1 Will Shortz

Euro used to be described as a large red kangaroo? Boy are times different. And we’ll see the rise and fall of meanings later with some cool visulizations.

euro Euro: Nickname for a wallaroo, a moderately large macropod, intermediate in size between kangaroos and wallabies

Machine Learning

So far I’ve injected my own insight to pick out the clusters of clues used. But let’s have machine learning take care of that.

The first thing I do is query my database for a given word to collect all the clues used. Then take all the clues and perform bag-of-words to get vectors of each tokenized clues. Scikit-Learn’s CountVectorizer is perfect for this. The result is a sparse matrix that looks something like this for the word CLAY

Clue Ali once great compromiser pigeon grass
Ali, once 1 1 0 0 0 0
The Great Compromiser 0 0 1 1 0 0
The Great Pacificator 0 0 1 0 0 0
Kind of pigeon 0 0 0 0 1 0
Grass alternative 0 0 0 0 0 1
Soil 0 0 0 0 0 0
Notable Whig 0 0 0 0 0 0

We can see that the Great Compromiser and The Great Pacificator similarly refers to early American politician Henry Clay because they share the word “Great”, and with k-means clustering by comparing their Euclidean distances, the algorithm can do a good job picking similar topics out.

Here’s a snippet of CLAY when trying 5 clusters:

Answer Clue Date DoW Cluster
CLAY Ali, once 4/26/1994 2 1
CLAY *Trapshooting … Ali … kiln 5/23/2002 4 1
CLAY Ali, before he was Ali 11/22/2010 1 1
CLAY The Great Pacificator 9/20/1987 7 2
CLAY The Great Compromiser 11/20/1988 7 2
CLAY The Great Pacificator 2/18/1991 1 2
CLAY Potter’s supply 12/25/2001 2 3
CLAY Modeler’s need 1/22/2006 7 3
CLAY Famous Whig 7/4/1978 2 3

Group 1 and 2 makes sense, but in group 3 how did Famous Whig get related to potter’s clay? So now I consider some concerns when it comes to crossword puzzles.

Considerations and Challenges

Crosswords are puzzles and clues are cleverly designed to be tricky - and therefore fun when you can figure it out. But this is a challenge for my bag-of-words model. Certain clues are phrased in such a way that there’s no immediate connection to the answer. For example, here a few of my favorite ones from this year:

  • Drooling from both sides of the mouth: DOUBLEDRIBBLE
  • Fly swatter?: BUZZERBEATER

It’s only when you realize that there’s a basketball theme in the puzzle that these answers start to make sense. How can I incorporate these nuances in my model?

In addition, clues are sparse, a few words at most. This reduces the number of features available to me to measure similarity.

But there are certain rules that I can use to my advantage:

  • Puzzles are symmetrical (rotational and at times horizontal)
  • Answers are longer than 3 letters
  • Tenses and plurality match
  • Abbreviations are indicated with periods or abbrv.
  • Foreign words are also hinted as such
  • Punctuations have meanings
    • ? Question marks mean wordplay
    • ! Exclamation marks indicate it’s a phrase or exclamation
    • ” ” Quotes are also for sayings, or the title of book, movie, or work
    • ___ Blanks indicates a fill-in-the-blank clue

In addition, answers come from common sources: geographical names like mountain ranges, rivers, and cities can be found in an atlas. References to movies, actors/actresses, books, or artists are usually popular enough.

I can’t address all of these challenges, but by creating my own tokenizer, I can stem words so that I capture all the different tenses and plurality. And by keeping capitalized words, I have a rough Named Entity Recognition system to relate similar proper nouns. I also keep track of any punctuation that shows up so that I can try to cluster similar clues like if wordplay or abbreviations are present.

Visualization

OK, now that I have beefed up my features, I can run my model and easily go from my Python dictionary output to a JSON so that Javascript and D3 can read it. I leverage NVD3 here for some great looking out-of-the-box charts.

The charts show how the answer was used in the past 40 years by graphing the number of occurrences of the clue cluster per year. Here are some notable illustrations:

UBER

As Professor Kruzman had shown in his NYT op-ed, UBER has gone through some changes. In the past 40 years, my algorithm gathered that there are 4 distinct clusters of clues used to describe UBER. In the light blue, we see it is consistently used as a German word such as Over, in Berlin. In the dark orange, as a fill in the blank in the German national anthem: “Deutschland ___ alles”. Then in late 90s, as a Modern prefix meaning “super”. In the past few years, and coinciding with the decline of UBER as a German word, it’s the Modern alternative to a taxi as we know it today.

EURO

As I mentioned earlier, EURO has an interesting history too. While it’s consistently used as a prefix for European things like eurodollar or eurobonds, before the mid 90s, EURO was sometimes used as a tough clue for a Large kangaroo. That use disappeared as the Euro currency started gaining traction in the nineties. Its use as a new currency peaked at 1999 with the official adoption of the currency. But then it lost its newness to gradually became a replacement currency. Perhaps a reminiscence of the franc or Mark.

NET

NET continues to be used as a term in sports such as fishing, volleyball, or tennis, and to a small degree as a bottom line modifier such as net take-home pay. But we can also trace the rise and importance of the internet in the 90s just from crossword puzzles.

What’s a good k?

K is the number of clusters in my k-means algorithms that I want to try to find. So far I’ve found k between 3-5 yields the best result, and I haven’t found any answers that have more than 6 different meanings over time. But as a next step, as I mention below, I want to make it dynamic for anyone to try to see what clusters come up if you can vary k.

Extensions and Next Steps

I only had 2 weeks to work on this project, so there are a lot of different things I want to eventually try.

  1. First of all, I want to go beyond k-means to perform my clustering. Even though I don’t have a lot of features, I think a Word2Vec model with neural network could yield better results. Or LDA or the new HDA.
  2. Next, make a fully functional web-app that many people can try by entering their own word and see how its usage in crossword puzzles differed over time. This should allow for a chosen k number of clusters and generate a graph in a reasonable amount of time.
  3. Finally, attempt to build off of this exploratory project and generate a themed crossword puzzle? I try to do this in my final project at Metis.

Last Word

This has been a very fun project for me, but look out as I host this project on the web for you to try out. Also, I will be putting my code on Github shortly. Let me know if you have questions or suggestions!