In Part 3, I described the implementation of the algorithms for reducing an LDPC code to an encodable form. At that point, all the algorithms were the most efficient possible. The only remaining performance gains would come from improving the code. This is rarely worth spending too much time on, but in this case the overall performance is completely dominated by two inner loops. One iterates through a sparse representation of a row, adding it to a dense row. The other iterates along the elements of the dense representation, adding them. Halving the time spent in both of these loops - just a few instructions each - will halve the execution time of the whole algorithm. So it's worth taking a close look.
Let's start with the add-sparse-to-dense loop. The original code used conventional STL iterators to scan through the elements of the sparse row, then for each element, converted it to the offset-and-mask combination for the particular bit number, and applied it using an xor operation. It's the obvious way. But each sparse row is added to a dense row tens of thousands of times, so it's worth considering whether any part of this operation can be amortized.
The final solution was to pre-calculate a vector containing the offset-and-mask for each entry. The latter was actually represented as a class, called "bitref". In the source code, this results in a vector<bitref>, which is iterated through in the usual way. The compiler is nevertheless clever enough to inline all this and reduce the inner loop to just four machine instructions: two to extract the offset and mask, one to perform the xor operation, and one to advance to the next entry. Not bad. Performance was improved substantially, reducing the time for phase 2 of the algorithm by a factor of about three.
There remains the overhead of the loop. Given the tiny size of the content, the two instructions of loop overhead, and their impact on pipelining, are worth worrying about. In assembler, the obvious way to unroll the loop would be to lay out sequential instructions corresponding to its maximum size, then jump into these at the appropriate point corresponding to the loop size (i.e. the number of entries in the vector). This is difficult to do in C++ though.
In the end I came up with something similar, but with a distinct unrolled loop for each possible size. This was done as my first tiny adventure in C++ template metaprogramming, and is described here. The compiler is smart enough to translate the switch statement based on vector size into an indexed jump, which then executes the entire loop in a straight line. That gave me about another 5% improvement.
Having got the inner loop as tight as possible, it was time to think about the next layer of the loop. Gcc does a good job of inlining functions when it's the right thing to do, but examination of the assembler output (-S option) showed that it was not inlining a couple of critical functions here. I played around with the compiler parameters that control inlining for a while and things got a little better, but I could just not convince it to inline one critical function. Of course the "nuclear option" of making it a macro always exists, but I really wanted to avoid that. I tried the "flatten" function attribute for the outer loop, which tells the compiler to inline absolutely everything, but after the compiler had run for half an hour or so I stopped it. I think it got put off by all the calls to boost::format that I use in my debug log macros.
Eventually, I found a minor rearrangement of functions that got everything inlined. That gave me another 5% or so performance improvement.
That dealt with the inner loop of phase 2, adding sparse rows to dense rows. In phase 3, the inner loop is adding dense rows to dense rows. Unrolling this loop was easier since it is always over the whole length of a dense row - over 64K bits, or 2K operations. There's nothing to be gained by completely unrolling such a loop. Instead I changed the code to do it in "gulps" of 16 entries at a time, then used a normal loop to deal with the remainder at the end. I also rearranged things here so that the call to the inner loop was fully inlined.
And that is about as far as things can be taken. The original C++ code took about 400 seconds for a column-weight 3, 32K data bit code. The final code takes under 7 seconds. I never ran a column-weight 5 code to completion with the original code - it would certainly have taken thousands of seconds, maybe much more. But now, it runs in about 45 seconds.
Of course there's a price to pay for all this. One of the first principles of writing maintainable systems is never to keep the same information in more than one way. This code violates that all over the place - for example, the sparse and dense representations of rows. But without this kind of approach, the code would be unusable anyway, so its maintainability wouldn't matter much. It has certainly been one of the most interesting bits of programming I've undertaken in a long timer.