Thursday, 1 December 2011

IRP rediscovered - first steps in Template Metaprogramming

One of the nice things about the PDP-11 assembler was its powerful macro features. Not only could you do basic text substitution, you could create loops using the REPT directive, for a fixed number of iterations, or IRP, which iterated over a list of arguments. It was especially good for setting up data structures, which nowadays would be viewed as a rather crude application specific language (ASL). (Before I start getting hate-mail, yes, I know this was originally from the PDP-10).

For whatever reason, the designers of C eschewed all this and just went for simple text substitution. Every now and then I have a bout of nostalgia for the PDP-11 assembler, especially when trying to build elaborate descriptive data structures. Of course there's always M4 but the learning curve is huge. Actually I'm a long way down the forgetting curve for M4, a long while back I built a very elaborate set of macros for tracking register usage and many other things for some MIPS assembler that I wrote. But it was a long time ago.

Then just the other day I really needed the old REPT directive. I've been working on a very interesting algorithm design problem, for reducing low-density parity check codes (LDPC) to a form where they can be encoded by practical hardware. The innermost loops of this algorithm are extremely performance critical - by nature this is an O(n^3) problem (i.e. the complexity increases with the cube of the size of the code). For a realistic sized code of say 32K data bits, the innermost part of the algorithm gets executed several billion times. Normally I'm content to let the compiler worry about the details of code optimization - today's compilers (gcc and MSVC) do a wonderful job. But in this case, saving a single instruction could cut seconds off the execution time, so it was worth digging a bit deeper.

Of course the first part of optimization is to use the right algorithms and data structures. I'd already done all that, cutting the execution time by a factor of thousands compared to our initial, simple implementation. Now I was looking to shave off another factor of two by paying attention to the details.

One such detail was to unfold the critical inner loops, replacing them by linear sequences of instructions with no tests or jumps. After some careful crafting of data structures, the loops were extremely tight, less than ten instructions. One of the loops has a large repeat count, so it was easy just to do it in gulps of 16 at a time. At that level the loop overhead is negligible, and when the remaining number is less than 16, the last few can be done one at a time.

The other loop was trickier though. The number of iterations is small, in the range 6-20, so the whole loop has to be done at once. A quick experiment showed that gcc implements a switch statement using a jump table, so it would be quick to dispatch to the right unrolled loop. But how to generate the code without tediously repeating the same statements over and over?

That was when I thought of using metaprogramming, i.e. programs that run at compile time rather than at execution. The idea is to declare a template class, parameterized by an integer that tells it how many instances you want. The resulting code looks like this:

template<int I> struct repeat
{
    void apply(vector<operation> &ops, vector<operand> &v)
    {
        ops[I-1].do(v);
        repeat<I-1>().apply(br, v);
    }
};

template<> void repeat<0>::apply(vector<operation> &ops, vector<operand> &v)  { };


The details of what's being done aren't too important here. "op" is a vector of operations, which says what to do and which operand vector element to apply it to. We want to make sure that each operation in the vector is applied.

The "apply" function first does the operation corresponding to its parameter, then recursively invokes the class with a parameter of one less. But how to get the recursion to stop? This is where the specialized function declaration comes in. The compiler will always choose an explicit specialization over the generic definition, so when the parameter reaches zero, this empty function is selected and the recursion stops.

The code that uses the class looks like this:

switch (ops.size()) {
case 6:
    repeat<6>().apply(ops,v);
    break;
 .
 .
 .
case 20:
    repeat<20>().apply(ops,v);
    break;
default:
    for (auto opi=ops.begin(); opi!=ops.end(); ++opi) {
         opi->do(v);
    }
    break;
}


I happen to know that the vector size will normally be in the range 6-20. The default is there so the code will work, albeit less efficiently, if it isn't. If you really had no idea of the limits, you would first deal with chunks of say 16 at a time, then finish off the remainder using the above technique.

It looks as though this will produce horrific code, with the recursion and everything else. If you compile without optimization, for debugging, indeed it does, with a deep nest of function calls, each with its own call, entry and exit sequences. But if you turn on full optimization, gcc produces exacly what you would if you hand coded - just the exact set of instructions required to implement each iteration of the loop. (I imagine MSVC would too, though I haven't tried it). You'll notice that the "repeat" object is instantiated, but since it has no content, this doesn't actually do anything.

To the real experts in metaprogramming (all dozen of them), this is child's play. But for the casual visitor to the topic, like myself, it's a neat technique that can save a lot of tedious and error-prone repitition. As I expected, unrolling this innermost of inner loops saved about 5% of the execution time, which is a useful contribution to my overall target of 50%.

No comments: