Sunday 12 May 2013

On typename - and why C++ is a parser's nightmare

If you've done any significant C++ programming using templates, you'll certainly have run into the annoying rule requiring you to write "typename" before constructs of the form "class::member" if the class is actually a template parameter. For example:

template <class C>
class foo
{
     typedef typename C::value_type value_type;
};

If you miss out "typename" the compiler will complain. What's more, if it's GCC it will complain in a completely mysterious fashion, giving you no clue as to what the actual problem is.

And yet, surely it's obvious that this must be a typename? Why require all those extra keystrokes and visual clutter for something which is obvious? Every now and then I'd wonder about this, and read something which described the obscure situations where it isn't obvious. But I'd promptly forget, until the next time I spent ages pondering over unhelpful error messages until the little light came on - "aha, it wants a 'typename'!".

There's a good explanation of why actually it isn't obvious, to the compiler at least, here. But it took me trying to explain C++ parsing to someone for me to really get it.

C++ teeters on the hairy edge of total ambiguity the whole time, without you even realising it as a user. And one of the worst culprits is the apparently innocent reuse of the less-than and greater-than symbols as template brackets. Consider the perfectly clear snippet:

foo<bah> foobah;

It's blindingly obvious that this is declaring 'foobah' to be an object of class 'foo' instantiated with 'bah' (presumably another class) as its template parameter.

Well, except that if all three names are actually ints, those template brackets suddenly turn into relational operators. It's not a very useful code snippet, but it is syntactically correct. First compare foo with bah, creating a bool result. Then compare that with foobah, having first cast the latter to bool. Then throw the (pretty meaningless) result away.

You don't even need templates. The reuse of '*' for both multiplication and dereferencing can also lead to ambiguity. Combining the two can get exciting:

foo<bah> *fbptr;

Obviously a declaration of a pointer to a 'foo<bah>'. Unless of course foo and bah are both numeric and fbptr is a pointer to a numeric. Then it's a replay of the previous example.

This is all made worse because C++ (necessarily) allows you to refer to class members before they're defined. Consider the following example:

class c1
{
    template<class C> class d1 
    {
        ....
    };
    class e1
    {
        ....
    };
};

class c2 : public c1
{
    void fn()
    {
        d1<e1> f1;
    }
    ....
    int d1, e1, f1;
};

When the parser sees the apparent declaration of f1, everything it knows tells it that this is indeed a declaration. Only later does it come across other declarations that completely change the interpretation. I  wonder how compilers deal with this - it would seem necessary to hold two completely different parse trees and hope that one of them will make sense later. Just to make it more interesting, the class could also go on to redefine the '<' and '>' operators, so they apply to different classes.

Even lowly Fortran IV wasn't immune to this kind of problem. It had no reserved words, and gave no significance to spaces. So when the compiler saw:

DO 100 I=1

which is obviously the beginning of a DO statement (equivalent to 'for' in C++), everything hinges on the next character. If it's a comma, this must indeed be a DO statement. But if it's say '+', or there is no next character on this line, then it's an assignment to a variable called 'DO100I' - and of course Fortran also didn't requires variables to be declared, so they could just pop up like this.

I'm glad I don't write compilers for a living!

No comments: