Solve WORDLE with R

by Tim Schatto-Eckrodt

posted on 2022-01-12

As seemingly everyone is playing WORDLE, I thought it would be fun to code something in R to help with the guesswork! While this technically can be considered cheating, it was still a fun exercise to code in R.¹

If you want to give it a go, you can find the code here. This requires the words package, by Conor Neilson, which contains a list of 175.393 official english scrabble words, with 9.330 words fitting the 5-letter constraint of WORDLE. You'll also need the tidyverse packages dplyr, stringr, and magrittr.

If you just want to solve those puzzles in a more reliable way, download the script, and use it like this. If you are interested in how the script works, you can scroll down and see how it works for yourself!

How To Use

Let's look at a step-by-step example! We'll start with aeons and get the following hint:

So we'll set the arguments accordingly. We know that there is an s at position 5 (phrase = "....s"), we know that the letters a,e, and n are not in the word (incorrect_chars = "aen"), and we know that there is an o, but not at position 3 (correct_chars = list("o"="3")).

f.solve_wordle(
    phrase = "....s",
    incorrect_chars = "aen",
    correct_chars = list(
        "o"="3"
    )
)

This returns a long list of words (263 to be exact). Let's pick one at random and try bogus next and receive the following hint:

Ok, now we know that there is an o at the second position and a u (but not at position 4). Running the following command returns 24 words.

f.solve_wordle(
    phrase = ".o..s",
    incorrect_chars = "aenbg",
    correct_chars = list(
        "o"="3",
        "u"="4"
    )
)

Let's try coups next. We get the following hint:

Adding the new characters to out function

f.solve_wordle(
    phrase = ".o..s",
    incorrect_chars = "aenbgc",
    correct_chars = list(
        "o"="3",
        "u"="4",
        "p"="4"
    )
)

This returns only three words (poufs, pours, and pouts)! Let's try pours next.

That was it! Nice!

How It Works

The code is pretty straightforward. It uses regular expressions (or regex) to filter the list of words from the words package using the hints given by WORDLE.

The f.solve_wordle() function takes four arguments. The first one, len, just let's you set the length of the words you want to guess. In accordance with the original WORDLE rules, this is set to 5 by default.

The second argument, phrase, is a string with the letters that are in the correct position, with . as wildcards. For example, if you know that there is a p at the second position, use phrase = ".p...". This looks like this in the R-code:

filter(out, str_detect(word, pattern = regex(phrase)))

In regex the . matches a single character, so a regex-pattern like a.. would match any string that starts with an a, which is followed by any two characters.

The next argument, incorrect_chars, is used to track incorrect guesses. For example, if you know there is no a, b, c, use incorrect_chars = "abc" (the order of the letters is not important). While this could be easily done by filtering all words that match a regex like this [abc], you would run into problems when you want to work with letters that occur multiple times in the WORDLE. For example, we might know that there is a single e, but no additional e in the WORDLE. If you guessed the first four letters correctly, e.g. thre, and you know that there is no additional e in the WORDLE, the answer must be threw. If an additional e was allowed, the answer could be both three and threw.

So, to solve this problem we first split the whole string of incorrect letters into individual letters (i.e. "abc" to "a", "b", "c"). Then we check for each letter if it was already guessed correctly (i.e. if it is part of the phrase and correct_chars argument). If it isn't, we just filter out all words that contain that letter filter(out, !str_detect(word, pattern = regex(incorrect_char))). If it is part of the correctly guessed letters, we filter out words where the letter occurs more than once. To do this we use a regex pattern like this: (x).*\1, where x is the letter we want to check. The \1 matches the same text as most recently matched by the 1st capturing group. Put everything together and it looks like this:

### split string to list of characters
incorrect_chars <- str_split(incorrect_chars, "")[[1]]
### iterate over list of characters
for (incorrect_char in incorrect_chars) {
    # if letter is not part of the correct letters
    if (
        !incorrect_char %in% correct_chars &&
        !str_detect(phrase, incorrect_char)
    ) {
        # simply filter all words that contain the letter
        out <- filter(out, !str_detect(word, pattern = regex(incorrect_char)))
    } else {
        # filter all words that contain the letter more than once
        out <- filter(out, !str_detect(word, pattern = regex(
            paste0("(", incorrect_char, ").*\\1")
        )))
    }
}

The correct_chars argument is used to track correct letters that are not in the right place. For example, if you know there is a g, but it's not at position 5 or 1, use correct_chars = list("g"="5,1"). The correct_chars must be a list with correct letters as keys and a string of numbers separated by a , for the incorrect positions.

As you can see, this argument keeps track of two things: 1) correct letters and 2) incorrect positions. To keep things simple we'll deal with these two things separately. As the correct_chars argument is a list (with the correct letters as keys and the corresponding incorrect positions as values), we first copy the correct_chars variable into a new variable (incorrect_positions) and then save the names (or keys) of correct_chars back to correct_chars.

incorrect_positions <- correct_chars
correct_chars <- names(correct_chars)

Now we can loop over each correct letter and check if it is part of the phrase. If it is not, we just filter words that contain that letter. It it is part of the phrase, we filter words that contain multiple occurrences of the letter, by using a positive lookahead in our regex, like this: x(?=.*x).

for (correct_char in correct_chars) {
    if (str_detect(phrase, correct_char)) {
        out <- filter(out, str_detect(word, pattern = regex(
            paste0(correct_char, "(?=.*", correct_char,")")
        )))
    } else {
        out <- filter(out, str_detect(word, pattern = regex(correct_char)))
    }
}

Lastly, we want to go over our list of incorrect_positions. Here we use a nested for-loop to filter out every word that has a letter in an incorrect position.

for (key in names(incorrect_positions)) {
    positions <- str_split(incorrect_positions[key], ",")[[1]]
    for (pos in positions) {
        filter_pat <- paste0("^.{", as.numeric(pos)-1, "}", key)
        out <- filter(out, !str_detect(word, pattern = regex(filter_pat)))
    }
}

This piece of code checks every incorrect position (splitting the string by ,) for every key/name in the incorrect_positions list. The regex for this part looks like this: ^.{0}x. The ^ stands for beginning of the word, the .{0} stands for a number of characters, and x is the letter that should not be at that position. So, ^.{3}b would look for words that start with three letters, followed by a b, i.e. words where the b is in the fourth position².

This of course has nothing to do with the fact that I suck at these kinds of letter-based games. ↩
Note that the 1st position corresponds to the 0th character in the string. See off-by-one_errors and zero-based numbering. ↩