The Code Names Bot generates clues for the board game Code Names by processing Wikipedia articles with Python and NLTK.
Background
Terminology
Term: A word card in Code Names. Examples: “Loch Ness”, “Apple”, “China.”
Clue: A potential clue that can be given for a term. Examples: “Monster”, “Fruit”, “Nation.”
Term page: A Wikipedia page that is related to a term. For example, term pages for the term “Apple” will include the Wikipedia page for Apple (fruit) and Apple Inc.
Clue page: The Wikipedia page for a clue.
Iteration Process
In order to consistently evaluate different clue generation strategies, I scored each strategy using this equation:
score=(# correct guesses — # incorrect guesses) / # clues given
Each strategy was tested against a fixed set of Code Names boards. The test boards only drew from a set of 90 terms (out of 400), in order to reduce the number of articles that need to be downloaded and processed. If a clue was given for 3 terms, and the player (me) guesses 2 correctly and 1 incorrectly, the score will be 1. A good clue generation strategy should result in more correct guesses than incorrect guesses and fewer clues given.
The scores have an arbitrary factor, since the terms that I guess for a given clue are different from what someone else might guess. The score for the strategy used in the final Code Names Bot is 1.2.
Clue Limitations
The Code Names Bot generates clues that contain only 1 word, as per Code Names rules. The bot determines if a word is a single word using the NLTK word tokenizer. The NLTK word tokenizer considers many hyphenated words to be a single word and therefore legal clues, while some Code Names communities consider hyphenated words to be illegal clues.
Preprocessing
The preprocessing steps will save a set of (clue, term, score) tuples into a sqlite database. The score from 0–1 will represent the relatedness of the clue to the term.
Step 1: Downloading Wikipedia Links and Titles
Instructions for downloading a sqlite file containing Wikipedia links and page titles can be found in the SixDegreesOfWikipedia repository. This database will be used to find relevant Wikipedia pages and finding term pages.
Step 2: Compile Term Pages
This is my process for finding the term pages for a given term:
- Find the disambiguation page for a term by appending “_(disambiguation)” behind it. Ex: Apple → Apple_(disambiguation).
- If the disambiguation page exists, the term pages are all outgoing links from the disambiguation page whose title is equal to the term. Outgoing links whose titles contain but are not equal to the term are not included, because it would otherwise result in “Quarterback” being a term page for “back,” even though they aren’t closely associated.
- Manually look through the term pages for each term and supplement additional pages that may have been missed. For example, the term “Conductor” should have the term page “Electrical_conductor.”
Step 3: Compile Synonyms
Synonyms are manually compiled. I experimented with using WordNet to produce synonyms for each term. However, some important synonyms are missing, such as “Deer” being a synonym for “Buck.”
Step 4: Identify Clue Pages of Interest
Wikipedia contains many millions of pages. In order for the downloading and processing jobs to run within a reasonable amount of time, potential clue pages need to be identified and filtered beforehand.
This is the process I used:
- From each term, find all neighbors that are connected via an incoming or outgoing link to its term pages.
- Filter out neighbors whose title contains more than 1 word. This prevents pages whose title isn’t a valid clue, such as “Invasive Species in Australia,” from being downloaded and processed.
- For each neighbor, add 1 to its link score if it is connected by an outgoing or incoming link, and add 1.5 if it is connected by both an incoming and outgoing link to a term page. The result is that a page with 3 links to 3 different terms will have a score of 3.
This is the histogram by score for the subset of 90 terms used for testing:
As seen, there are a large amount of pages that are only connected to one term. These pages will be ignored since we want clue pages that can be used to clue for multiple terms. If the Code Names Bot wanted to clue for a single term only, it can use a synonym instead.
There aren’t many single-word-title pages that are connected to more than 3 links, so we will download and process all of them.
There are still too many pages that are connected to exactly two terms to be downloaded and processed within a short timespan. These pages will be filtered by limiting each pair of terms to 10 clue pages. For example, the terms “America” and “Australia” have many potential clue pages, but only 10 will be processed.
Step 5: Download Clue Pages and Source Pages
Clue pages and source pages will be downloaded from Wikipedia, with the page’s text stored in a sqlite database along with the page id. This is a fairly straightforward process, with some considerations to be aware of:
- Downloading pages using multiple threads or asynchronously will be much faster than using a single thread.
- Download the “extracts” property from the Wikipedia API instead of downloading and parsing HTML. Downloading text only is faster since it won’t contain unnecessary files such as images.
- Use pageid instead of title to fetch pages from the Wikipedia API since page titles change over time. For example, the “Organ_(anatomy)” page in the Wikipedia titles database has since been renamed to “Organ_(biology).”
Step 6: Count Terms in Clue Pages
For each clue page, the number of occurrences of each term will be counted. Some considerations:
- All terms will be counted for each clue page, instead of only the terms that were linked to the page. This is because pages like “Volleyball” also contain many instances of the term “Block,” even though “Volleyball” doesn’t have a link to “Block.” By counting all terms, cases like these can be counted.
- For each term, count all occurrences of itself and its inflections. For example “Striking” and “Struck” should also be counted for the term “Strike.” This allows for a more comprehensive count. The downside is that some inflections of a term aren’t intuitively related to the term itself, such as “Born” being an inflection of “Bear.”
- Parts of speech will be counted separately, and the maximum will be chosen as the final term count. This is because terms can have very different meanings between parts of speech. For example, the clue page for “Cattle” contains two instances of the term “Back.” One sentence has “back to the mouth,” where “back” is used as an adverb. Another sentence has “back of the cattle,” where “back” is a noun. This should only result in a term count of 1.
- Named entities will be counted separately, with the maximum count returned. Since human players determine the relatedness of a clue to a term based on its strongest link instead of the sum of its links, the term count for each noun entity will be treated separately. For example, “Entrepreneurship” contains both “Bill Gates” and “Bill Hewlett,” so the occurrences of “Bill Gates” and “Bill Hewlett” will be counted separately.
Optimization notes:
- NLTK’s tagger constructs a new PerceptronTagger every time it is called, which is expensive. Constructing the PerceptronTagger and using it directly results in shorter runtimes.
- Noun chunking and determining named entities is an expensive process. First use the tagger to check if proper nouns exist. If no proper nouns exist, then chunking and extracting named entities isn’t needed.
Step 7: Extract Clues from Term Pages
In addition to using clue pages to identify clues, term pages can also be processed to find potential clues. For example, the term “Mammoth” is never used in the clue page “Animal,” while “Animal” is mentioned several times in the term page for “Mammoth.” By processing term pages, “Animal” will be discovered as a clue for “Mammoth.”
Clues can be found from term pages by extracting noun chunks. If the noun chunk contains a number, then it should be ignored. This prevents frequent numeric nouns such as “Day” or “Year” from being clued for terms that aren’t related. For each noun chunk, count the number of occurrences of its root noun in the page.
Step 8: PageRank Clue Pages
The PageRank score of each clue page is calculated to determine how obscure it is. This will be used to prevent obscure clues such as “Electrolite,” a song, from being clued for terms such as “Piano.”
Optimization note: The code in the Git repo uses an iterative method to calculate PageRank, since there isn’t enough memory on a personal computer to use the matrix method on all of the Wikipedia pages.
Step 8: Scoring Clues
For each term and clue combination, a score from 0–1 is calculated representing the confidence that a guesser can identify the link. For each term count in a clue page, the score for the clue and term is calculated as
min(1, PageRank of clue page / 6) * 1–0.7^term count
An exponential is used to allow pages with a higher term count to have a higher score, while keeping the score below 1. Most pages that are commonly known have a page rank of at least 6. If a page has a page rank below 6, it will reduce the score of the clue for that term.
Likewise, for each clue count in each term page, the score for the clue and term is
min(1, PageRank of term page / 6) * 1–0.7^clue count
The maximum score of each clue and term combination will be used for the clue generation process.
Clue Generation
The clue generator takes in a set of positive terms and a set of negative terms and returns the best clue to give. For example, if the Code Names Bot is giving clues to blue team, the positive terms will be the blue terms, while the negative terms will be the red terms, blank terms, and death term.
The best clue is determined in these steps:
- For each positive term, get the set of possible clues for the term (the final output of the preprocessing step). Union all of these clue sets to get the set of possible clues for the positive terms.
- For each possible clue, find the maximum score between the clue and the negative terms. This will be the threshold score of the clue.
- For each possible clue and each positive term, if the score is greater than the threshold score, add it to the clue’s final score and add the term to the clue’s term list.
- Find the clue with the highest score, and give that clue. The number associated with the clue is the number of terms in the clue’s term list.
This returns the clue that has the highest expected number of correct guesses while avoiding having negative terms guessed.