Thursday, 8 December 2011

Algorithm Design: Efficient LDPC Encoding (Part 3: Implementation)

In Part 2 I described the algorithms that need to be implemented in order to transform the base matrix of an LDPC code into a reduced form that can be used by a practical encoder. As I mentioned in Part 1, we originally built a very straightforward Python implementation, where the matrix was represented literally as a bunch of rows of 0s (mostly) and 1s (rarely). Extrapolating from its performance with toy-sized codes, it would have taken months or years to reduce a life-sized (>32K bits) code. We needed something a bit faster, like a few seconds, so I set out on a C++ implementation. C++ is naturally 20-50 times faster than Python, but it would take a lot more than that.

The first, obvious, step was to change the representation to a sparse array, where only the 1 values are held explicitly. The Python code spent most of its searching arrays of 0s trying to find the occasional 1, adding a further O(n) to its execution time.

During the first phase, all the work consists of swapping rows and columns. To support this efficiently, the sparse array consists of "bitnodes" representing a 1. They are linked into lists both for the row and for the column, and contain pointers back to each of these. This means that when rows are swapped, the columns get to find out about it with no further work, and vice versa. The implementation makes extensive use of the Boost intrusive library, about which I've already eulogized. In the original implementation, the row and column lists were held in order, though I ended up rethinking this later. Here is the structure of a bitnode:

   class bitnode
        typedef bi::set_member_hook<bi::link_mode<bi::auto_unlink> > row_hook_t;
        typedef bi::list_member_hook<bi::link_mode<bi::auto_unlink> > col_hook_t;
        row_hook_t row_hook;
        col_hook_t col_hook;
        row_t *my_row;
        col_t *my_col;

        // member functions follow

Note the use of the intrusive member hooks, which allow the same structure to be linked into several lists (or sets). The backpointers to the row and column allow the row and column numbers to be tracked as they are swapped, which would not be the case if they were held explicitly.

This basic implementation worked well for codes with a column weight of 3, taking about 300 seconds to transform a 32K bit code. For a column weight of 5, though, which results in a much larger gap, it was unusable.

A little instrumentation showed that all the time was spent adding rows together. In the set-based implementation of sparse rows, every addition involved either the creation or the deletion of a node in a tree, a relatively expensive operation. The solution was to switch to a dense representation for the gap rows only. So, just before starting phase 2 (elimination of the ones in the F region of the matrix), the gap rows are converted to a dense representation, with one bit per possible position. This is simple enough in theory but took a lot of reworking of other structures, such as the columns. It was worth it, though: the time dropped to around 60 seconds for the column weight 3 codes, and to around 300 seconds for the column weight 5 ones.

Adding a sparse row to a dense row means walking the bitnodes in the sparse row and xor'ing the corresponding bit. Adding a dense row is just a tight loop xor'ing the 32-bit words together, an O(n) operation. These two inner loops are the key to performance - we'll come back to them later.

As always, when you speed up one part, you find another bottleneck. In this case it was phase 1 again. The best column to swap when making the diagonal was selected by simply scanning them linearly, which is obviously expensive. The solution was to keep a constantly-sorted list of the best one - actually a priority queue, implemented yet again as a boost intrusive set. However this changes constantly - when a row has been incorporated into the lower triangle, the columns it contains now have one less 1 in the region of interest. Increasing the gap also affects it. Fortunately, the row structure makes it easy to update just the columns that are directly affected, which is O(b), and then to correct their position in the list. Hence the total operation each time is O(b log(n)) which is much better than than O(n) as previously.

For a column weight of 3, this made phase 1 practically disappear as a performance concern, as I expected. But for a column weight of 5, it was still taking the majority of the time - which I didn't expect. Further analysis showed that keeping the columns in order was very expensive. Every time a row was moved to the gap, every column had to be re-sorted. On further thought, there is only one time when it helps for a column to be sorted, which is when it is being processed as the diagonal element. So just sorting it there, once per column, would work just as well and remove an O(n) element from the algorithm With this change, phase 1 moved down into the noise - for a column weight 3 code, it is about 4% of the total time.

At this point there are no further fundamental improvements to be made - the order of work to be done for each phase cannot be reduced. Further improvement can only come by coding optimizations, which will be discussed in Part 4.

No comments: