Tuesday 6 December 2011

Algorithm Design: Efficient LDPC Encoding (Part 2: Algorithms)

In Part 1, I described the problem we are trying to solve, taking a sparse matrix and solving the corresponding system of simultaneous equations (around 33000 of them) so that we can build an efficient hardware encoder for Low Density Parity Check (LDPC) codes.

Efficient encoding requires that the original sparse matrix be transformed such that all the encoder has to do is calculate a number of parity checks. Most of these are very sparse, so they can use shared hardware. A small proportion (about 3% in a typical code) are dense, i.e. they have about the same number as 1s and 0s, and so cannot share hardware.

The resulting transformed matrix is called the "reduced" matrix, and when it is complete it has the following form:

+------------------------+--------+------------------------+
|            D           |    E   |           F            |
+------------------------+--------+------------------------+
|            A           |    B   |           C            |
+------------------------+--------+------------------------+

Rows in the D/E/F part are called the "gap" in the literature. Initially the reduced matrix is set to be identical to the base matrix, and the gap is empty. In a matrix representing a system of simultaneous equations, such as this, rows can be swapped without changing the meaning, as can columns. Also, rows can be added together (although columns cannot be). In binary addition, 1+1=0. We use these facts to rearrange the matrix into reduced form, by the following steps.

1. Transform part C into "lower triangular" form (LTF), in which everything above the main diagonal is zero. This can be done by swapping rows and columns. At each step, we look for a column that has just a single entry above the current diagonal row, then swap it with the current diagonal column. Finding a suitable column is the key to performance at this step.

2. Sometimes, we can't find such a column. This is how the gap gets created. We choose a column with the smallest number of such entries and swap that. Then we exchange rows so that the populated rows move into the gap area.

3. When this part is finished, C is in lower triangular form, but the gap is not. The next task is to complete the task for the gap, by emptying F altogether and getting E into lower triangular form. So far, all rows are still sparse, since no row or column has been changed apart from ordering.

4. For each row in F, we eliminate all bits using Gaussian elimination. Starting with the rightmost one bit, we add the corresponding row from C which has this as its rightmost bit (i.e. on the diagonal). We repeat this, moving leftward, until the F part of each row has been emptied. In the process, the rest of the row becomes dense, with on average as many 1s as 0s.

5. We now have F empty, and we need to transform E into lower triangular form. We do this by Gaussian elimination again, this time using rows from the gap. We start from the bottom and work up, creating the diagonal as we go, so that we don't put back bits that we have already eliminated.

6. Now we're done. E and C between them have a neat diagonal line with nothing above it. F is empty. A and B are still sparse, but D and the lower triangle of E are dense. All the bits in columns in A and D are data bits. The check bits, in the remainder of the matrix are generated from these.

Let's take a look at the performance of each of these steps. First we need to define some terms:

b: the number of 1 bits in a single row in the base matrix. This is small, and independent of the size of the code. It's also referred to as the row weight. We refer to the number of 1s in a single column as the column weight.

g: the number of rows in the gap region (D/E/F). Although this is much smaller than the number of data bits, it is directly linked to it. For a column weight of 3, it is about 3.3% of it. Hence anything which is O(g) is also O(n), though with a much smaller actual value.

n: the total number of rows.

The task falls into three phases:

-- Phase 1: rearrangement of rows and columns to create the C region. This has to be done once for each row (less the gap rows), and each time, we have to select the best available row. If we simply scan the rows looking for the best one, this will be O(n), making the overall task O(n2). We'll explain later how this can be made O(n log(n)). In addition we have to create the gap. Rippling a row up into the gap is O(n), and has to be done for each gap row, so the total task is O(n*g). In principle this is O(n2), but because g is so much smaller than n, with suitable design it can be kept small, comparable with the O(n log(n)) time of the row rearrangement.

-- Phase 2: eliminating all the 1s in the F region. There are g rows to deal with, and once the process starts they quickly become dense. Hence O(n*g) row additions are required, where one (the gap row) is dense, and the other, coming from the C region, is sparse. The amount of work per addition is O(b), making the whole task O(n*g*b).

-- Phase 3: eliminating the upper half of the triangle in the E region. There are g rows, and O(g) bits to be eliminated in each row, so there are O(g2) additions. Since these involve adding dense rows to each other, the amount of work per addition is O(n), making the whole task O(n*g2) - or in other words, O(n3). For small codes, this phase is dominated by phase 2, but as the code size increases it starts to dominate the total time. This is especially true if larger row or column weights are used, since the gap becomes proportionately larger (about 10% of the total rows for a column weight of 5).

These are the fundamental limits of the algorithm - no matter how clever the design, the three phases will have complexity O(n log(n)), O(n*g*b) and O(n*g2) respectively. The trick of a good implementation is to achieve these limits, and to minimize the actual values in each case. Part 3 discusses how this was done.

No comments: