Sunday 11 December 2011

Algorithm Design: Efficient LDPC Encoding (Part 4: Optimization)

In Part 3, I described the implementation of the algorithms for reducing an LDPC code to an encodable form. At that point, all the algorithms were the most efficient possible. The only remaining performance gains would come from improving the code. This is rarely worth spending too much time on, but in this case the overall performance is completely dominated by two inner loops. One iterates through a sparse representation of a row, adding it to a dense row. The other iterates along the elements of the dense representation, adding them. Halving the time spent in both of these loops - just a few instructions each - will halve the execution time of the whole algorithm. So it's worth taking a close look.

Let's start with the add-sparse-to-dense loop. The original code used conventional STL iterators to scan through the elements of the sparse row, then for each element, converted it to the offset-and-mask combination for the particular bit number, and applied it using an xor operation. It's the obvious way. But each sparse row is added to a dense row tens of thousands of times, so it's worth considering whether any part of this operation can be amortized.

The final solution was to pre-calculate a vector containing the offset-and-mask for each entry. The latter was actually represented as a class, called "bitref". In the source code, this results in a vector<bitref>, which is iterated through in the usual way. The compiler is nevertheless clever enough to inline all this and reduce the inner loop to just four machine instructions: two to extract the offset and mask, one to perform the xor operation, and one to advance to the next entry. Not bad. Performance was improved substantially, reducing the time for phase 2 of the algorithm by a factor of about three.

There remains the overhead of the loop. Given the tiny size of the content, the two instructions of loop overhead, and their impact on pipelining, are worth worrying about. In assembler, the obvious way to unroll the loop would be to lay out sequential instructions corresponding to its maximum size, then jump into these at the appropriate point corresponding to the loop size (i.e. the number of entries in the vector). This is difficult to do in C++ though.

In the end I came up with something similar, but with a distinct unrolled loop for each possible size. This was done as my first tiny adventure in C++ template metaprogramming, and is described here. The compiler is smart enough to translate the switch statement based on vector size into an indexed jump, which then executes the entire loop in a straight line. That gave me about another 5% improvement.

Having got the inner loop as tight as possible, it was time to think about the next layer of the loop. Gcc does a good job of inlining functions when it's the right thing to do, but examination of the assembler output (-S option) showed that it was not inlining a couple of critical functions here. I played around with the compiler parameters that control inlining for a while and things got a little better, but I could just not convince it to inline one critical function. Of course the "nuclear option" of making it a macro always exists, but I really wanted to avoid that. I tried the "flatten" function attribute for the outer loop, which tells the compiler to inline absolutely everything, but after the compiler had run for half an hour or so I stopped it. I think it got put off by all the calls to boost::format that I use in my debug log macros.

Eventually, I found a minor rearrangement of functions that got everything inlined. That gave me another 5% or so performance improvement.

That dealt with the inner loop of phase 2, adding sparse rows to dense rows. In phase 3, the inner loop is adding dense rows to dense rows. Unrolling this loop was easier since it is always over the whole length of a dense row - over 64K bits, or 2K operations. There's nothing to be gained by completely unrolling such a loop. Instead I changed the code to do it in "gulps" of 16 entries at a time, then used a normal loop to deal with the remainder at the end. I also rearranged things here so that the call to the inner loop was fully inlined.

And that is about as far as things can be taken. The original C++ code took about 400 seconds for a column-weight 3, 32K data bit code. The final code takes under 7 seconds. I never ran a column-weight 5 code to completion with the original code - it would certainly have taken thousands of seconds, maybe much more. But now, it runs in about 45 seconds.

Of course there's a price to pay for all this. One of the first principles of writing maintainable systems is never to keep the same information in more than one way. This code violates that all over the place - for example, the sparse and dense representations of rows. But without this kind of approach, the code would be unusable anyway, so its maintainability wouldn't matter much. It has certainly been one of the most interesting bits of programming I've undertaken in a long timer.

Thursday 8 December 2011

Algorithm Design: Efficient LDPC Encoding (Part 3: Implementation)

In Part 2 I described the algorithms that need to be implemented in order to transform the base matrix of an LDPC code into a reduced form that can be used by a practical encoder. As I mentioned in Part 1, we originally built a very straightforward Python implementation, where the matrix was represented literally as a bunch of rows of 0s (mostly) and 1s (rarely). Extrapolating from its performance with toy-sized codes, it would have taken months or years to reduce a life-sized (>32K bits) code. We needed something a bit faster, like a few seconds, so I set out on a C++ implementation. C++ is naturally 20-50 times faster than Python, but it would take a lot more than that.

The first, obvious, step was to change the representation to a sparse array, where only the 1 values are held explicitly. The Python code spent most of its searching arrays of 0s trying to find the occasional 1, adding a further O(n) to its execution time.

During the first phase, all the work consists of swapping rows and columns. To support this efficiently, the sparse array consists of "bitnodes" representing a 1. They are linked into lists both for the row and for the column, and contain pointers back to each of these. This means that when rows are swapped, the columns get to find out about it with no further work, and vice versa. The implementation makes extensive use of the Boost intrusive library, about which I've already eulogized. In the original implementation, the row and column lists were held in order, though I ended up rethinking this later. Here is the structure of a bitnode:

   class bitnode
    {
    private:
        typedef bi::set_member_hook<bi::link_mode<bi::auto_unlink> > row_hook_t;
        typedef bi::list_member_hook<bi::link_mode<bi::auto_unlink> > col_hook_t;
        row_hook_t row_hook;
        col_hook_t col_hook;
        row_t *my_row;
        col_t *my_col;
    public:

        // member functions follow
     .
     .
    }; 

Note the use of the intrusive member hooks, which allow the same structure to be linked into several lists (or sets). The backpointers to the row and column allow the row and column numbers to be tracked as they are swapped, which would not be the case if they were held explicitly.

This basic implementation worked well for codes with a column weight of 3, taking about 300 seconds to transform a 32K bit code. For a column weight of 5, though, which results in a much larger gap, it was unusable.

A little instrumentation showed that all the time was spent adding rows together. In the set-based implementation of sparse rows, every addition involved either the creation or the deletion of a node in a tree, a relatively expensive operation. The solution was to switch to a dense representation for the gap rows only. So, just before starting phase 2 (elimination of the ones in the F region of the matrix), the gap rows are converted to a dense representation, with one bit per possible position. This is simple enough in theory but took a lot of reworking of other structures, such as the columns. It was worth it, though: the time dropped to around 60 seconds for the column weight 3 codes, and to around 300 seconds for the column weight 5 ones.

Adding a sparse row to a dense row means walking the bitnodes in the sparse row and xor'ing the corresponding bit. Adding a dense row is just a tight loop xor'ing the 32-bit words together, an O(n) operation. These two inner loops are the key to performance - we'll come back to them later.

As always, when you speed up one part, you find another bottleneck. In this case it was phase 1 again. The best column to swap when making the diagonal was selected by simply scanning them linearly, which is obviously expensive. The solution was to keep a constantly-sorted list of the best one - actually a priority queue, implemented yet again as a boost intrusive set. However this changes constantly - when a row has been incorporated into the lower triangle, the columns it contains now have one less 1 in the region of interest. Increasing the gap also affects it. Fortunately, the row structure makes it easy to update just the columns that are directly affected, which is O(b), and then to correct their position in the list. Hence the total operation each time is O(b log(n)) which is much better than than O(n) as previously.

For a column weight of 3, this made phase 1 practically disappear as a performance concern, as I expected. But for a column weight of 5, it was still taking the majority of the time - which I didn't expect. Further analysis showed that keeping the columns in order was very expensive. Every time a row was moved to the gap, every column had to be re-sorted. On further thought, there is only one time when it helps for a column to be sorted, which is when it is being processed as the diagonal element. So just sorting it there, once per column, would work just as well and remove an O(n) element from the algorithm With this change, phase 1 moved down into the noise - for a column weight 3 code, it is about 4% of the total time.

At this point there are no further fundamental improvements to be made - the order of work to be done for each phase cannot be reduced. Further improvement can only come by coding optimizations, which will be discussed in Part 4.

Tuesday 6 December 2011

Algorithm Design: Efficient LDPC Encoding (Part 2: Algorithms)

In Part 1, I described the problem we are trying to solve, taking a sparse matrix and solving the corresponding system of simultaneous equations (around 33000 of them) so that we can build an efficient hardware encoder for Low Density Parity Check (LDPC) codes.

Efficient encoding requires that the original sparse matrix be transformed such that all the encoder has to do is calculate a number of parity checks. Most of these are very sparse, so they can use shared hardware. A small proportion (about 3% in a typical code) are dense, i.e. they have about the same number as 1s and 0s, and so cannot share hardware.

The resulting transformed matrix is called the "reduced" matrix, and when it is complete it has the following form:

+------------------------+--------+------------------------+
|            D           |    E   |           F            |
+------------------------+--------+------------------------+
|            A           |    B   |           C            |
+------------------------+--------+------------------------+

Rows in the D/E/F part are called the "gap" in the literature. Initially the reduced matrix is set to be identical to the base matrix, and the gap is empty. In a matrix representing a system of simultaneous equations, such as this, rows can be swapped without changing the meaning, as can columns. Also, rows can be added together (although columns cannot be). In binary addition, 1+1=0. We use these facts to rearrange the matrix into reduced form, by the following steps.

1. Transform part C into "lower triangular" form (LTF), in which everything above the main diagonal is zero. This can be done by swapping rows and columns. At each step, we look for a column that has just a single entry above the current diagonal row, then swap it with the current diagonal column. Finding a suitable column is the key to performance at this step.

2. Sometimes, we can't find such a column. This is how the gap gets created. We choose a column with the smallest number of such entries and swap that. Then we exchange rows so that the populated rows move into the gap area.

3. When this part is finished, C is in lower triangular form, but the gap is not. The next task is to complete the task for the gap, by emptying F altogether and getting E into lower triangular form. So far, all rows are still sparse, since no row or column has been changed apart from ordering.

4. For each row in F, we eliminate all bits using Gaussian elimination. Starting with the rightmost one bit, we add the corresponding row from C which has this as its rightmost bit (i.e. on the diagonal). We repeat this, moving leftward, until the F part of each row has been emptied. In the process, the rest of the row becomes dense, with on average as many 1s as 0s.

5. We now have F empty, and we need to transform E into lower triangular form. We do this by Gaussian elimination again, this time using rows from the gap. We start from the bottom and work up, creating the diagonal as we go, so that we don't put back bits that we have already eliminated.

6. Now we're done. E and C between them have a neat diagonal line with nothing above it. F is empty. A and B are still sparse, but D and the lower triangle of E are dense. All the bits in columns in A and D are data bits. The check bits, in the remainder of the matrix are generated from these.

Let's take a look at the performance of each of these steps. First we need to define some terms:

b: the number of 1 bits in a single row in the base matrix. This is small, and independent of the size of the code. It's also referred to as the row weight. We refer to the number of 1s in a single column as the column weight.

g: the number of rows in the gap region (D/E/F). Although this is much smaller than the number of data bits, it is directly linked to it. For a column weight of 3, it is about 3.3% of it. Hence anything which is O(g) is also O(n), though with a much smaller actual value.

n: the total number of rows.

The task falls into three phases:

-- Phase 1: rearrangement of rows and columns to create the C region. This has to be done once for each row (less the gap rows), and each time, we have to select the best available row. If we simply scan the rows looking for the best one, this will be O(n), making the overall task O(n2). We'll explain later how this can be made O(n log(n)). In addition we have to create the gap. Rippling a row up into the gap is O(n), and has to be done for each gap row, so the total task is O(n*g). In principle this is O(n2), but because g is so much smaller than n, with suitable design it can be kept small, comparable with the O(n log(n)) time of the row rearrangement.

-- Phase 2: eliminating all the 1s in the F region. There are g rows to deal with, and once the process starts they quickly become dense. Hence O(n*g) row additions are required, where one (the gap row) is dense, and the other, coming from the C region, is sparse. The amount of work per addition is O(b), making the whole task O(n*g*b).

-- Phase 3: eliminating the upper half of the triangle in the E region. There are g rows, and O(g) bits to be eliminated in each row, so there are O(g2) additions. Since these involve adding dense rows to each other, the amount of work per addition is O(n), making the whole task O(n*g2) - or in other words, O(n3). For small codes, this phase is dominated by phase 2, but as the code size increases it starts to dominate the total time. This is especially true if larger row or column weights are used, since the gap becomes proportionately larger (about 10% of the total rows for a column weight of 5).

These are the fundamental limits of the algorithm - no matter how clever the design, the three phases will have complexity O(n log(n)), O(n*g*b) and O(n*g2) respectively. The trick of a good implementation is to achieve these limits, and to minimize the actual values in each case. Part 3 discusses how this was done.

Algorithm Design: Efficient LDPC Encoding (Part 1: Background)

I've been working lately on a system design which requires, among other things, highly effective and efficient error-correcting codes (ECC). We've decided to use a Low Density Parity Check (LDPC) code. These are currently considered to be the best "soft" ECCs, i.e. where there is information about the reliability of each received bit as well as its putative value. The story behind LDPCs is interesting: they were invented by Robert Gallager in his PhD thesis in 1960, but they were way beyond contemporary computing power. It didn't help that when he wrote the definitive textbook on ECCs in 1966, he didn't mention them! So they languished, forgotten, until a decade ago. By then TurboCodes had been independently invented. They also provided a means for "near Shannon limit coding", i.e. extracting as much data from a noisy signal as theoretically possible.

LDPCs have two properties which led to the problem I needed to solve. First, there is no formula that provides the best code for a given set of constraints (block size and code rate). You can use the same general scheme to build ten different codes, the details being decided by a random number generator, and some will be significantly better than others. That means that to find the code you want to use in practice, you need to generate a whole bunch of them and try them out over a large number of messages and error densities.

That leads to the second problem. An LDPC starts out as a very sparse matrix, describing a large number of parity checks each of which covers a small number of bits - hence the name. We want to have 32768 bits of user data, and a reasonable configuration is to have each bit covered by three checks. If we use a half-rate code (same number of data bits and check bits) then each check covers six bits. So we have a matrix where each row is 64K bits long and has just six 1 bits.

The matrix doesn't say anything about which bits are data and which are check bits, only that a valid codeword has to satisfy all the checks. So given 32K data bits, the way to generate the corresponding 32K check bits is to solve the 32K simultaneous equations that the sparse matrix implicitly describes. Easy!

Well, no, not easy at all. The practical use of LDPCs requires a transformation of the matrix into something that normal hardware or software can encode in a linear and reasonable amount of time. Solving the equations directly is an O(n3) problem, i.e. the time required increases with the cube of the number of unknowns. So we have to some preprocessing on the matrix to get it into a form that the hardware can work with. There's an excellent paper by Qi and Goertz describing how to go about this. The algorithm it describes is, not surprisingly, also O(n3). This needs to be run for every trial code, and we would like to try hundreds of them.

Our first attempt at coding the algorithm was written in Python, using the obvious data representation, i.e. a big matrix containing mostly 0s and a few 1s. It was written so we could understand the algorithms and piece together a complete system, rather than for performance. On a "toy" code of a few hundred bits, it took a couple of minutes to run. On slightly larger codes - nowhere near the size we need for our system - it took most of the day. By extrapolation, to generate a life-size code would have taken months or years.

Clearly, we needed an implementation more focused on performance - not just code optimization, but selecting algorithms to minimize the time at step of the algorithm. And that is where it begins to get interesting. More on that in Part 2.

Thursday 1 December 2011

IRP rediscovered - first steps in Template Metaprogramming

One of the nice things about the PDP-11 assembler was its powerful macro features. Not only could you do basic text substitution, you could create loops using the REPT directive, for a fixed number of iterations, or IRP, which iterated over a list of arguments. It was especially good for setting up data structures, which nowadays would be viewed as a rather crude application specific language (ASL). (Before I start getting hate-mail, yes, I know this was originally from the PDP-10).

For whatever reason, the designers of C eschewed all this and just went for simple text substitution. Every now and then I have a bout of nostalgia for the PDP-11 assembler, especially when trying to build elaborate descriptive data structures. Of course there's always M4 but the learning curve is huge. Actually I'm a long way down the forgetting curve for M4, a long while back I built a very elaborate set of macros for tracking register usage and many other things for some MIPS assembler that I wrote. But it was a long time ago.

Then just the other day I really needed the old REPT directive. I've been working on a very interesting algorithm design problem, for reducing low-density parity check codes (LDPC) to a form where they can be encoded by practical hardware. The innermost loops of this algorithm are extremely performance critical - by nature this is an O(n^3) problem (i.e. the complexity increases with the cube of the size of the code). For a realistic sized code of say 32K data bits, the innermost part of the algorithm gets executed several billion times. Normally I'm content to let the compiler worry about the details of code optimization - today's compilers (gcc and MSVC) do a wonderful job. But in this case, saving a single instruction could cut seconds off the execution time, so it was worth digging a bit deeper.

Of course the first part of optimization is to use the right algorithms and data structures. I'd already done all that, cutting the execution time by a factor of thousands compared to our initial, simple implementation. Now I was looking to shave off another factor of two by paying attention to the details.

One such detail was to unfold the critical inner loops, replacing them by linear sequences of instructions with no tests or jumps. After some careful crafting of data structures, the loops were extremely tight, less than ten instructions. One of the loops has a large repeat count, so it was easy just to do it in gulps of 16 at a time. At that level the loop overhead is negligible, and when the remaining number is less than 16, the last few can be done one at a time.

The other loop was trickier though. The number of iterations is small, in the range 6-20, so the whole loop has to be done at once. A quick experiment showed that gcc implements a switch statement using a jump table, so it would be quick to dispatch to the right unrolled loop. But how to generate the code without tediously repeating the same statements over and over?

That was when I thought of using metaprogramming, i.e. programs that run at compile time rather than at execution. The idea is to declare a template class, parameterized by an integer that tells it how many instances you want. The resulting code looks like this:

template<int I> struct repeat
{
    void apply(vector<operation> &ops, vector<operand> &v)
    {
        ops[I-1].do(v);
        repeat<I-1>().apply(br, v);
    }
};

template<> void repeat<0>::apply(vector<operation> &ops, vector<operand> &v)  { };


The details of what's being done aren't too important here. "op" is a vector of operations, which says what to do and which operand vector element to apply it to. We want to make sure that each operation in the vector is applied.

The "apply" function first does the operation corresponding to its parameter, then recursively invokes the class with a parameter of one less. But how to get the recursion to stop? This is where the specialized function declaration comes in. The compiler will always choose an explicit specialization over the generic definition, so when the parameter reaches zero, this empty function is selected and the recursion stops.

The code that uses the class looks like this:

switch (ops.size()) {
case 6:
    repeat<6>().apply(ops,v);
    break;
 .
 .
 .
case 20:
    repeat<20>().apply(ops,v);
    break;
default:
    for (auto opi=ops.begin(); opi!=ops.end(); ++opi) {
         opi->do(v);
    }
    break;
}


I happen to know that the vector size will normally be in the range 6-20. The default is there so the code will work, albeit less efficiently, if it isn't. If you really had no idea of the limits, you would first deal with chunks of say 16 at a time, then finish off the remainder using the above technique.

It looks as though this will produce horrific code, with the recursion and everything else. If you compile without optimization, for debugging, indeed it does, with a deep nest of function calls, each with its own call, entry and exit sequences. But if you turn on full optimization, gcc produces exacly what you would if you hand coded - just the exact set of instructions required to implement each iteration of the loop. (I imagine MSVC would too, though I haven't tried it). You'll notice that the "repeat" object is instantiated, but since it has no content, this doesn't actually do anything.

To the real experts in metaprogramming (all dozen of them), this is child's play. But for the casual visitor to the topic, like myself, it's a neat technique that can save a lot of tedious and error-prone repitition. As I expected, unrolling this innermost of inner loops saved about 5% of the execution time, which is a useful contribution to my overall target of 50%.

Favourite Restaurants #4: Kaiten Sushi, Shinbashi, Japan

When I first started travelling to Japan, I would generally stay at the Shiba Park Hotel. My business there was at the Japanese national standards body, whose offices were just across the street from the Tokyo Tower and a short and pleasant walk through the Shiba Park itself from the hotel.

In the evening a longer walk - fifteen minutes or so, one subway stop - led to the Shinbashi area. This is a maze of tiny side streets, packed with minuscule restaurants that fill with salarymen (the Japanese word for middle-class office workers) at lunchtime. After work they're back, for a beer or two with their colleagues, and a plate of noodles or sushi before setting out on their long commute to the distant suburbs. It was in one of these, many years ago, that a friend who was learning Japanese managed to order a plate of chicken sashimi (yes, just raw chicken) with a bowl of what tasted like rotten strawberry jam.

Close to the main square at Shinbashi Station, the one with the steam locomotive in it, is a kaiten sushi restaurant. Kaiten - written 回転 in Japanese - just means "turning round". You've probably been to one - instead of ordering from a waiter, you have a conveyor belt in front of you covered in little dishes of sushi. You take whatever you want, and at the end they figure out your bill by counting the plates. This system depends absolutely on having a very high turnover. It takes only a short while, maybe 15 minutes, for the fish to start to dry out and look distinctly unappetising. Kaiten sushi tends to be a lunchtime thing, when there are big crowds in a short time.

An additional benefit of course is that you don't need to be able to speak the language. Assuming you can recognise the things you like - or don't mind taking a risk - you just pick things out as they pass.

A further sophistication of the same idea is to replace the conveyor belt by a canal with little boats carrying the plates of sushi. This was an American invention - Isobune Sushi in San Francisco's Japantown claims to have invented it, though for all I know so does every other boat sushi restaurant in the country. To my great frustration, I have never been able to work out what makes the boats move round the canal.

But back to Shinbashi. We first went to the Kaiten sushi on our first trip together to Japan (though we'd both travelled to Japan before). It's a very unassuming place, full of salarymen during the week and shoppers at the weekend. It's important to go when it's busiest, before about 1.30 - as I explained before. Sometimes that means a bit of a wait, then you squeeze onto two tiny stools (if there are two of you of course - though it's very common for people to eat there on their own), squashed between the other diners. Service is minimal, though courteous and attentive anyway since this is Japan. Every three places or so there's a hot water tap, a pile of cups and a box of teabags (o-cha - green tea - of course), along with a chopstick dispenser, napkins, soy sauce and pickled ginger. You just take what you need and wait for your favourite sushi to roll by. If you want beer or sake, you have to order that.

In the middle of the island, three or four sushi chefs toil continuously, replenishing the dishes. If you watch them carefully you can see what they are making, usually in batches of half a dozen or so dishes, and if it's something you're waiting for, you can prepare to grab it quick. The normal protocol is just to take things from the belt, but if you want something that isn't there or is a bit special, you can ask one of the chefs and they'll make it for you.

When you've had enough, you just stand up and walk to the door. The cashier shouts to the other staff, who counts your dishes and shouts back the price, you pay - usually in cash - and that's it. There's nor formality to it and of course, in Japan, no tipping.

For some reason we really took to this place. Every time we go to Japan we manage to squeeze in a visit. It hasn't changed in the 20+ years we've been going there, although I guess the staff must have moved on. Each time we dread that it will have closed - so many of our favourite spots in Tokyo have closed and been replaced by office buildings, like the "Rubbery Pancakes" breakfast spot next to the Shiba Park. But, so far, it has still been there every time.