Thoughts from upside down: Fun with AVX and Wordle

A First Attempt - in Kotlin

When the Wordle game first appeared a couple of years ago, I wrote a helper program to do various things, in particular to suggest the best word to try given progress so far in the game. More later on how that works. I wrote the program in the fairly obscure language Kotlin, partly because I like it but also for performance. The "best next word" function potentially requires comparing every word in the dictionary with every other, making for millions of comparisons. The obvious language for this kind of job is Python, but it generates very slow interpreted code - around 100 times slower than native machine code. The absence of strong compile-time typing also makes it painful to work with, for anything longer than a page or so of code.

Kotlin addresses both of these problems. It has strong typing, nicely done so it mostly doesn't get in the way. And it compiles to JVM code, which supposedly is within a factor of 10 of native machine code. It also has an awful, incomprehensible build system, but that didn't matter for this problem since I wasn't planning on generating a stand-alone program. Working within the IDE was fine.

Another advantage of Kotlin is that it very easily supports multi-threaded operation, taking full advantage of a multi-core CPU. The problem lends itself very easily to a highly concurrent implementation. As is well-known, Python's thread support only allows a single thread to run at a time (the dreaded "Global Interpreter Lock").

My Kotlin program worked, but its performance left a lot to be desired. Running the "best word" algorithm at the start of the game, when the number of possibilities is largest, takes about 40 seconds, and that is running on an 18 core system. The htop command shows that every one of the 18 cores is running at 100% for as long as the command takes to complete.

At that point the program had served its purpose. It was an interesting exercise in writing functional style code in Kotlin. I'd figured out the fairly complex algorithms for comparing and matching words. I had no real intention of using it to play Wordle anyway.

AVX

AVX, and more specifically AVX512, is the latest stage in the development of so-called vector processing for the Intel/AMD X86 family. (That's no longer quite true, because Intel recently announced further extensions called AVX10 - but it will be a while before you can buy a CPU that implements it).

Vector-processing instructions, or SIMD (Single Instruction Multiple Dispatch), have been around for a couple of decades now, originally called SSE. They have gone through several evolutions. AVX512 provides 512-bit registers which can hold 16 32-bit floating point or integer values. A single instruction can do a simultaneous multiply/add with 16 values. With careful coding to keep the pipeline full, one of these can be executed every clock cycle, so about 50 billion multiply/adds per second. And that's per core - a typical server CPU now has 24 cores, so around 1 trillion operations per second.

AVX was conceived for arithmetic operations, especially vector/matrix/tensor multiplication. You don't need to know anything about AVX to use it for this - there are excellent libraries, and even without them the compiler can often figure out how to optimise "normal" code to take advantage of AVX.

Over time, there have been various additions and tweaks to the available instructions to support a wider range of applications. One of the most impressive is simdjson, which uses AVX to accelerate parsing JSON text by a factor of 5-10. It is far from obvious how to do this, even when you have read the paper describing it!

In recent years, matrix and tensor processing has become much more important because of Artifical Intelligence. Neural networks, at the heart of AI, involve huge amounts of matrix arithmetic. For a while the preferred approach was to use GPUs, which are essentially matrix multiplication engines. But with AVX, you can get comparable performance from a CPU, and without the overhead of moving vast quantities of data to and from the GPU.

Unless you code in assembler, you don't get to write individual AVX instructions. Rather you use intrinsics, built-in functions defined by Intel but implemented on GCC and other compilers. Some of these map directly to specific AVX instructions, while others are rearranged and optimised. As a C++ (or whatever) programmer, you don't need to know. But if you're interested, you can always look at the assembler output (-S option) from the compiler, or with the debugger.

The 128, 256 and 512 bit AVX registers don't correspond to any normal data type, so special types are defined for them like __m256 (256 bits representing 8 32-bit floats) and __m512i (representing 16 32 bit integers). The compiler figures out how to use corresponding registers or to map them to memory.

It was AI that led me to take a close look at AVX, for professional reasons. And that got me wondering whether there was a way to speed up the key algorithms for Wordle.

Wordle Algorithms

There are two key algorithms in any program designed to work with Wordle. I'll use the terms target word for the unknown word you're trying to find, and test word for the word you type, looking to see how closely it matches the target.

The first algorithm compares a test word with the target word, finding which letters are exact matches, the right letter in the right place (colored green in the Wordle game) and which are partial matches, the right letter but in the wrong place (colored orange in the game). This is harder than it looks, mainly because of words containing repeated letters. For example if you compare "every" with the target "trade", one "e" is orange and the other black: every. If you compare "poppy" with "plump", you get one "p" of each color: poppy.

The second algorithm - not needed by the game itself but required by any kind of helper app - is to take a word and a match pattern (which letters are green or orange), and determine whether it could apply to a given target word. Once again this is made harder by repeated letters. The result every matches words containing a single "e", but not words with two or more. Whereas every matches words with two "e"s or more, but not those with only one.

Another algorithm is needed to select the best word to refine the set of possible words. The goal is to divide the remaining words into the largest number of distinct subsets, of more or less equal size. For example, suppose that there are 16 remaining words. The ideal test word would create 16 sets each of one word, so there is only one possible choice. That's unlikely, but four sets of four words each is much better than three sets of one word and one containing the remaining 13.

There are no doubt many ways to do this, but the one I chose was to use the concept of entropy taken from information theory. The entropy of a collection of values is given by the formula:

where p_i is the probability associated with this value, such that the sum of all the p_i values is always 1.

In our case, we count the number of words associated with each distinct value of the match result. Then,

where n_i is just the number of words associated with a given result.

So to find the best word, we take each word in the dictionary as a test word, and calculate the entropy when it is applied to the remaining set of words after the preceding test words. The one with the highest entropy is the best.

Combining AVX with the Algorithms

My starting point for using AVX was to represent a word as a list of bitmaps, one for each letter. A letter map is a 26-bit map (represented in a 32-bit integer) with one bit for each letter. When representing a specific word, it's a one-hot map, i.e. only a single bit is set. But it can also represent a set of possible letters.

A word map is a list of letter maps, one for each of the five letters in a Wordle world. Conveniently, this fits into a single AVX 256-bit register, meaning operations can be performed on a whole word in a single instruction.

It's obvious that the exactly-matching letters between two words can be found simply by taking the logical-and of their respective word maps. There will be a bit set in each letter position where there is an exact match, and nowhere else.

This needs to be turned into a 5-bit bitmap. Fortunately AVX512 contains the perfect instruction for this. Each 32-bit integer can be independently compared with a value, setting the corresponding bit in an 8-bit mask if the condition is satisfied. So in our case, we set up a register with zero in every position, and compare for greater. In two AVX instructions - the logical-and followed by the compare-to-mask - we can generate the bitmap for the exact match.

Sometimes we need to count the number of matches. There is another (non-AVX) intrinsic for doing this, __builtin_popcnt which counts the number of set bits in a word. So to count the number of matching letters takes just three instructions:

logical-and of the two word maps
compare-to-mask to generate a mask word with a 1 for every matching position
__builtin_popcnt to count the 1-bits

Finding the partial matches would be easy - if there were never any repeated letters. We create a word map with every letter in the target at every position, except where an exact match has already been found. Then we do the logical-and and compare, as above. Every letter that us a partial match will be a 1.

This fails for repeated letters though, whether in the target or the test word (though differently). To fix this, we have to keep separate letter and word maps for twice- and thrice-repeated letters. The exact algorithm is too complex to explain here (but can be seen in the code).

This done, the complete match() function amounts to about 20 instructions, mostly AVX.

The conforms() Algorithm for AVX

Like match(), conforms() would be very simple if no words contained repeated letters. It can be done as follows:

Ensure that the target contains no letters that failed to match in the test word. This is a single logical-and between the respective letter masks.
Ensure that the target contains all letters that have matched (partial or exact) in the test word. This is also a logical-and of the letter masks.
Ensure that any exact matches are indeed matched. This is a logical-and between the word masks.
Ensure there are no unintended exact matches betwen the two words, again a logical-and followed by a test for zero.

However this is complicated by repeated letters, which may result in both false positives and false negatives. Each repeated letter has to be dealt with separately - there can be at most two of them in any valid English five-letter word. For each letter, we create a map of where to look for it (excluding exact matches) and a count for how many we expect to find. There are two cases of that:

The match result contains no failed matches for the letter, as described above. In that case it's OK if the number of matching letters is greater than the count.
The match result does contain failed matches. In that case the number of matching letters must be exactly the same.

Normally, a match result is tested against a lot of other words - the set of all words that have matched so far. At the beginning of the game this is all words in the vocabulary. So it makes sense to generate an optimised match_target object that contains all of the necessary letter maps and word maps as well as information for repeated letters. Then every test is just a bunch of logical-ands and related instructions, mostly AVX.

Entropy for AVX

The entropy calculation is fairly complicated, and is performed for every possible target word during the "find best word" operation. So it is worth optimising.

The naive algorithm described above requires pre-computing the total of all values, then performing a division for every value. These can be avoided with a bit of algebra, resulting in the following formula:

This looks complicated, but it's very easy to compute efficiently. A single loop computes the first term, and the sum of all values (Σn). A single calculation at the end makes the adjustments.

There's a very nice header-only library called avx_mathfun which provides the usual math functions implemented with AVX, so this operation can be done 16 at a time.

There is one complication: taking the log of zero returns the value nan (not a number). But it's easy to replace all zeroes with 1 before taking the log. The resulting product is zero, as it should be.

The algorithm works on 16 values at a time, in parallel. At the end, the 16 partial sums are all in the same 512-bit register, and must be added together. Perhaps surprisingly, AVX doesn't provide a "horizontal add" instruction, not even as an intrinsic. It has to be done by first adding pairs taken from 16 to produce 8 partial sums, then again to produce 4, then 2, and finally a single value.

Results and Performance

Once I'd figured out the key algorithms, it was quick enough to implement them and to verify their correctness. Rather more time was spent putting the whole program together with a command line interface and all the rest. I included code for measuring the performance of the individual key algorithms.

The result was a pleasant surprise. The original Kotlin code took about 700 core-seconds to find the best next word, at the start of the game. My new AVX code does it in 770 mS, or nearly 1000 times faster. Most of this time is spent in match(), which takes about 23 nS to produce the match result for two words, i.e. about 43 million per second. That is just running on a single core - with this performance it isn't worth the complication of running on multiple cores.

I'm running on a 3 Ghz, 18-core Intel i9, though this program only uses one core.

The AVX entropy() function takes about 570 nS to compute a single entropy result. This is about five times faster than the naive implementation doing each calculation separately. At first I wondered why this wasn't better - after all it is doing 16 operations in parallel. But the naive implementation can skip zero values, whereas the AVX implementation can't unless a whole 16-value AVX register is all zero, which is unlikely.

This could be improved, by considering only the possible match results. There are actually only three outcomes per letter - exact, partial, or none. So the number of useful match results is 243 (3⁵) rather than 1024. This would improve performance by a factor of 4 or so, but the entropy calculation is less than 1% of the total work, so it isn't worthwhile.

Conclusions

You can do cool stuff with AVX and go a lot faster. That isn't really a surprise. But it isn't easy, once outside the classic mathematical workload.

As a programming discipline, it is very different from traditional one-thing-at-a-time computing. You can't do A, then depending on the outcome do either B or C, at least not in a general way. In that respect it is like programming a quantum computer - not that anyone will be doing much of that for a long while yet. There also, you have to carry all the results in parallel, without asking them to interfere with each other.

If you want to take a look at actual code, it is here. The interesting stuff is in the files wordle_word.cpp and entropy.cpp.

Update, 2024-11-20

Sutom and AVX512 Performance

There's a French variant of Wordle called Sutom, published in Le Monde. It uses longer words, of varied length up to about 10 letters. The first letter is given as a clue, and guesses must repeat any previously-guessed letters. I thought it would be good to adapt my program to support that, too.

Implementing the slightly changed rules is straightforward. Supporting words longer than 8 means using a 512-bit (16 x 32) representation for the word. That in turn means using the 512-bit AVX instructions instead of the 256 bit ones. There is a performance penalty for these instructions, at least on Intel - not just the instructions themselves but the whole CPU clock gets slowed down, so all instructions are affected.

I reworked the code to hide all the AVX intrinsics behind an avx:: namespace, using function overloading so the caller doesn't need to know whether they're passing an _mm256i or an _mm512i. That makes it very easy to confine the distinction to a single typedef, so it's easy to build a 256 bit or 512 bit variant without modifying any executable code.

In the event though, the performance impact is small. Using 5-letter words, with a 512-bit representation, is about 15% slower for the most peformance intensive functions. This is unnoticeable for a human user.

The VPCONFLICT Instruction

The documentation for AVX instructions and intrinsics is very poor. Individual functions are documented, though details like parameter order often have to deduced by trial and error. But without reading the hundreds of pages one by one, there's no overall summary of the kinds of things you can do. To make it worse, the different word lengths aren't orthogonal. There are 512-bit instructions that don't exist for 256-bit values, and vice versa. Where the same function is available, they are often still subtly different.

So it was quite by chance that I discovered the VPCONFLICT instruction and its associated intrinsics _mmnnn_conflict_epi32(). This compares every 32-bit word with every other, finding identical values. I have no idea what it's for as far as vector arithmetic is concerned, but this corresponds exactly to the problem of finding duplicate letters in a word.

One wrinkle is that it doesn't tell you about each repeat. Rather, it tags each value with how many previous values it conflicts with. So the word abbccc will return {0,0,1,0,1,2}. This is useful, but since my algorithm requires, for example, a mask containing all letters that have been repeated exactly once, it takes some craftiness to use the value returned by the instruction.

The challenge was to write the code as just a linear sequence of AVX instructions, without any branches or looping. Reinstating the first and second occurrences of repeated letters is done by:

use the conflict() intrinsic to find conflicts
create a _mmask value representing all non-zero values
use that to create a single letter mask of repeated letters, with a horizontal OR
replicate that to every position
compare the original word mask to this
anywhere that matches is a repeated letter

I wrote this code just for fun. It's only used when loading the dictionary, which only takes a few milliseconds when the program starts up. Performance is about 25% better than the obvious approach of explicitly comparing the letters with each other and looping over the result. But it was fun.

AVX2 Compatibility

There's a machine I sometimes have to use that only has AVX2, i.e. 256-bit instructions. It would be nice to be able to run the program on that.

The compare_mask instructions are only available in AVX512, even for 256-bit values. So these have to be replaced with explicit calculation of the mask, by looking at each 32-bit word in turn. This is also the case for the mask_blend instructions, that take a bitwise mask and use it to select which value to use from two inputs. So that too has to be emulated.

The conflict intrinsics, described above, are also not available, but I already had the code to do this.

Overall performance for the AVX2-only code is about 25% worse than for AVX512, which doesn't matter in practice. That would be a bit worse if 512-bit instructions were emulated using two 256-bit instructions, but I doubt it would be noticeable to a human user.

Thoughts from upside down

Wednesday, 9 August 2023

Fun with AVX and Wordle