Project 4 at Metis is focused around unsupervised machine learning and Natural Language Processing. But we also incorporate MongoDB for data storage, Flask as a Python based web framework, and visualization with D3.
I love crossword puzzles and have been doing them for many years. Can I combine crossword puzzles with data science?!
- Data Science Stack
- Exploratory Data Analysis
- Machine Learning
- Considerations and Challenges
- Next Steps
- Last Word
So of course, I researched and Googled to see if anything similar has been done, and I came across a timely op-ed piece in the New York Times. Professor Charles Kruzman from UNC wrote a very interesting article looking at how the NYT Crossword puzzles say about the words we use. That’s it! I can extend on his project by using data science and unsupervised machine learning. While Kruzman’s research focused on the foreign language usage, I want to perform topic modeling on clues and answers, without manual input.
I was able to obtain about 40 years of NYT crossword puzzles with further research on the web. So I parsed the raw data from JSON and stored it locally with MongoDB. That’s 14,000+ puzzles with 1.2 million clue-answer pairs.
We know that crossword puzzles are creative and clever, but most importantly they are current. Constructors and editors are well aware of what their solvers know. In other words, crossword puzzles reflect culture and current events. And by looking at them over time, we can see how American society has shifted and how the English language has evolved.
My goal is straightforward: Given a word or phrase used as an answer in past crossword puzzles, I want to visualize how its usage and meaning has changed over time. And in doing so, get an unique view into how American culture and language has evolved.
Exploratory Data Analysis
With access to our raw data in place, I can query the database for a specific answer and see find all the clues used to describe the answer over time.
For example, here’s a snippet of the answer ELMO:
|Answer||Clue||Date||Day of Week||Editor|
|ELMO||Patron saint of sailors||10/30/2008||4||Will Shortz|
|ELMO||Sailor’s saint||1/15/2009||4||Will Shortz|
|ELMO||Googly-eyed Muppet||4/20/2009||1||Will Shortz|
|ELMO||Red-haired PBS star||10/4/2009||7||Will Shortz|
|ELMO||Muppet with a goldfish named Dorothy||12/13/2009||7||Will Shortz|
|ELMO||Must-have toy of 1996||12/26/2009||6||Will Shortz|
|ELMO||Citizen of Sesame Street||3/7/2010||7||Will Shortz|
|ELMO||Adm. Zumwalt, chief of naval operations during…||7/7/2010||3||Will Shortz|
|ELMO||Tickle Me ___||3/28/2011||1||Will Shortz|
|ELMO||Patron saint of sailors||4/5/2011||2||Will Shortz|
|ELMO||“Tickle Me” doll||7/11/2011||1||Will Shortz|
Date is informative and Day of Week indicate the difficulty of the puzzle. In the New York Times, Monday (1) is easiest and gets progressively harder throughout the week, with Saturday being the toughest. Sunday is usually a bigger puzzle at about a Wednesday-Thursday difficulty.
There’s a good amount of variations in clues here, but they mostly refer to the furry red muppet on Sesame Street. In the 80s (not shown), ELMO is mostly used to describe St. Elmo, the patron’s saint for sailors or the pollster Elmo Roper. Indeed, Roper only appeared once since the 90s. We immediately see the difference of how Elmo is used.
Here’s another example that I enjoy with the word EURO:
|Answer||Clue||Date||Day of Week||Editor|
|EURO||Large kangaroo||5/12/1987||2||Eugene T. Maleska|
|EURO||Large kangaroo||2/19/1988||5||Eugene T. Maleska|
|EURO||Kind of bonds or dollars||10/19/1989||4||Eugene T. Maleska|
|EURO||Wallaroo||3/8/1990||4||Eugene T. Maleska|
|EURO||Large kangaroo||7/17/1990||2||Eugene T. Maleska|
|EURO||Replacement for the mark, franc and lira||5/14/2012||1||Will Shortz|
|EURO||New circulator of 2002||6/16/2012||6||Will Shortz|
|EURO||Prefix with zone||6/17/2012||7||Will Shortz|
|EURO||Replacement for the mark and franc||7/16/2012||1||Will Shortz|
|EURO||Continental coin||10/29/2012||1||Will Shortz|
Euro used to be described as a large red kangaroo? Boy are times different. And we’ll see the rise and fall of meanings later with some cool visulizations.
Euro: Nickname for a wallaroo, a moderately large macropod, intermediate in size between kangaroos and wallabies
So far I’ve injected my own insight to pick out the clusters of clues used. But let’s have machine learning take care of that.
The first thing I do is query my database for a given word to collect all the clues used. Then take all the clues and perform bag-of-words to get vectors of each tokenized clues. Scikit-Learn’s CountVectorizer is perfect for this. The result is a sparse matrix that looks something like this for the word CLAY
|The Great Compromiser||0||0||1||1||0||0||…|
|The Great Pacificator||0||0||1||0||0||0||…|
|Kind of pigeon||0||0||0||0||1||0||…|
We can see that the Great Compromiser and The Great Pacificator similarly refers to early American politician Henry Clay because they share the word “Great”, and with k-means clustering by comparing their Euclidean distances, the algorithm can do a good job picking similar topics out.
Here’s a snippet of CLAY when trying 5 clusters:
|CLAY||*Trapshooting … Ali … kiln||5/23/2002||4||1|
|CLAY||Ali, before he was Ali||11/22/2010||1||1|
|CLAY||The Great Pacificator||9/20/1987||7||2|
|CLAY||The Great Compromiser||11/20/1988||7||2|
|CLAY||The Great Pacificator||2/18/1991||1||2|
Group 1 and 2 makes sense, but in group 3 how did Famous Whig get related to potter’s clay? So now I consider some concerns when it comes to crossword puzzles.
Considerations and Challenges
Crosswords are puzzles and clues are cleverly designed to be tricky - and therefore fun when you can figure it out. But this is a challenge for my bag-of-words model. Certain clues are phrased in such a way that there’s no immediate connection to the answer. For example, here a few of my favorite ones from this year:
- Drooling from both sides of the mouth: DOUBLEDRIBBLE
- Fly swatter?: BUZZERBEATER
It’s only when you realize that there’s a basketball theme in the puzzle that these answers start to make sense. How can I incorporate these nuances in my model?
In addition, clues are sparse, a few words at most. This reduces the number of features available to me to measure similarity.
But there are certain rules that I can use to my advantage:
- Puzzles are symmetrical (rotational and at times horizontal)
- Answers are longer than 3 letters
- Tenses and plurality match
- Abbreviations are indicated with periods or abbrv.
- Foreign words are also hinted as such
- Punctuations have meanings
- ? Question marks mean wordplay
- ! Exclamation marks indicate it’s a phrase or exclamation
- ” “ Quotes are also for sayings, or the title of book, movie, or work
- ___ Blanks indicates a fill-in-the-blank clue
In addition, answers come from common sources: geographical names like mountain ranges, rivers, and cities can be found in an atlas. References to movies, actors/actresses, books, or artists are usually popular enough.
I can’t address all of these challenges, but by creating my own tokenizer, I can stem words so that I capture all the different tenses and plurality. And by keeping capitalized words, I have a rough Named Entity Recognition system to relate similar proper nouns. I also keep track of any punctuation that shows up so that I can try to cluster similar clues like if wordplay or abbreviations are present.
The charts show how the answer was used in the past 40 years by graphing the number of occurrences of the clue cluster per year. Here are some notable illustrations:
As Professor Kruzman had shown in his NYT op-ed, UBER has gone through some changes. In the past 40 years, my algorithm gathered that there are 4 distinct clusters of clues used to describe UBER. In the light blue, we see it is consistently used as a German word such as Over, in Berlin. In the dark orange, as a fill in the blank in the German national anthem: “Deutschland ___ alles”. Then in late 90s, as a Modern prefix meaning “super”. In the past few years, and coinciding with the decline of UBER as a German word, it’s the Modern alternative to a taxi as we know it today.
As I mentioned earlier, EURO has an interesting history too. While it’s consistently used as a prefix for European things like eurodollar or eurobonds, before the mid 90s, EURO was sometimes used as a tough clue for a Large kangaroo. That use disappeared as the Euro currency started gaining traction in the nineties. Its use as a new currency peaked at 1999 with the official adoption of the currency. But then it lost its newness to gradually became a replacement currency. Perhaps a reminiscence of the franc or Mark.
NET continues to be used as a term in sports such as fishing, volleyball, or tennis, and to a small degree as a bottom line modifier such as net take-home pay. But we can also trace the rise and importance of the internet in the 90s just from crossword puzzles.
What’s a good k?
K is the number of clusters in my k-means algorithms that I want to try to find. So far I’ve found k between 3-5 yields the best result, and I haven’t found any answers that have more than 6 different meanings over time. But as a next step, as I mention below, I want to make it dynamic for anyone to try to see what clusters come up if you can vary k.
Extensions and Next Steps
I only had 2 weeks to work on this project, so there are a lot of different things I want to eventually try.
- First of all, I want to go beyond k-means to perform my clustering. Even though I don’t have a lot of features, I think a Word2Vec model with neural network could yield better results. Or LDA or the new HDA.
- Next, make a fully functional web-app that many people can try by entering their own word and see how its usage in crossword puzzles differed over time. This should allow for a chosen k number of clusters and generate a graph in a reasonable amount of time.
- Finally, attempt to build off of this exploratory project and generate a themed crossword puzzle? I try to do this in my final project at Metis.
This has been a very fun project for me, but look out as I host this project on the web for you to try out. Also, I will be putting my code on Github shortly. Let me know if you have questions or suggestions!-->