Thoughts from upside down: computing

Showing posts with label computing. Show all posts

Thursday, 13 August 2020

The Doing Nothing Contract, or How Not to Run Large Projects

Soon after I left DEC, in 1995, I got involved in what would have been the biggest Systems Integration (SI) project they had ever done. This is the story of the project.

I joined DEC when I left university, 20 years earlier. It was a fantastic place for an engineer to work, and I enjoyed nearly every day I worked there. But in 1995 it was obviously going downhill - it was acquired first by Compaq in 1999, and then by HP - and I found a way to make a decent exit. While I was looking for another job, I started a consultancy business which turned out to keep me busy for the next four years.

About a year later I ran into a former colleague on a plane. He told me about a project that they were working on for a major European telco. It was going to be huge. Did I know anyone who might be able to help? Well, yes, there was me. I think he knew that and was just being polite. I did point out that my daily rate was over double what DEC would normally pay. This wasn't a problem, he said, because they wanted to assemble an elite team of top-level architects to get the overall design right. Within a week I had a purchase order for three months of my time, 40 hours per week, at my usual high daily rate.

The following Monday I showed up at the DEC office in Reading, England. There were two other people in the "elite" team. One was Dave, who I knew quite well - like me, he had already spent 20 years as an employee.

The project was very interesting, on an oft-repeated theme. Telephone networks have always been built using proprietary systems built by specialized suppliers like (then) GEC and Alcatel, costing at least ten times more than normal contemporary computers. The client had figured out that this was just a big distributed computing application, and wanted to run their national telephone and data network using off the shelf computer hardware and, as far as possible, software.

It seems like a wonderful idea, but it has been tried several times and so far has never really worked (which I guess is a spoiler for this article). The problem is that, even now and certainly 25 years ago, these telcos were used to being almost their suppliers' only customer. They could make incoherent or outrageous demands, confident that their suppliers would have to follow. The price tag reflected this, but the people making the demands - the engineers - weren't the people paying the bills, so those dots never got joined up inside these huge bureaucracies. A typical interaction would go:

Supplier: the project will use database X and transaction software Y
Telco: that's no good, we need features P, Q and R that X and Y don't have
Supplier: we could add those, but it's bespoke engineering and will add (lots) to the cost
Telco: no, we want off-the-shelf software, we don't want to pay for custom development
Supplier: X and Y is what's on the shelf, you want something else, you have to pay for it
Telco: (utter incomprehension)

In its latest iteration, this has led to the ETSI NFV (Network Function Virtualization) project, which in its 8 years of existence, so far, has yet to deliver an actual functioning network.

Anyway... our mission was to use off the shelf DEC computers and software to build the switching control system at the heart of the network. It isn't really a hard problem. The basis of it is extremely simple: take a number that someone has dialled (this was a while ago) and translate it into a series of simple instructions to the physical switches, like "connect channel 92 of trunk 147 to channel 128 of trunk 256".

The only thing that makes it hard is the scale - this has to work for millions of users and concurrent calls. But even then, none of these actions have to be closely synchronised. It isn't like, say, Facebook. where something you upload needs to become instantly visible to a billion users around the world.

Within a week, Dave and I had figured out how to put the available software components together and have a working, scalable prototype within a couple of months. Turning that into a production system would be a much bigger job, needing integration to the telco's dozens of management systems, but that was all low risk, routine stuff. We started writing code.

What we had completely failed to take into account, working in our cosy little office, was the DEC bureaucracy. In a much larger open-plan office nearby was an already-large team of project managers, program managers, project documentation specialists, and for all we knew telephone sanitizers as well. They operated in blissful ignorance of any actual technical details, as they came up with the cost estimates that would be at the heart of the formal bid for the project.

DEC has always been thought of as a computer manufacturer, but they had a large and thriving SI business as well. Over the years they had built up a series of procedures and processes for managing these projects. They were mostly pretty small - integrate a driver for a new piece of hardware into an operating system, or build a user interface around a database application. But some were big - tens of person-years - and a few were really big, like our telco project.

So part of the process was to know when the project was too big for the current level of project management. When that happened, the project would get escalated to the next tier of project management. They would bring in their own team to look at the design, the business aspects, the risk, and everything else.

The first thing the new team would do is to multiply all the existing work estimates by two or more, just as a matter of principle because that is nearly always right. (A very successful SI company CEO who I knew years ago always multiplied all engineering estimates by pi. He claimed it worked every time). Then they would add a whole new layer of program managers, project documentation specialists, telephone sanitizers and all the rest. Then they would start looking at the details, invariably resulting in another factor of two or so.

Our project had already been through two such escalations. Realistically this was probably a 50 person-year project, but the estimates were already in the hundreds, maybe 20 times the original figure. As a result it triggered yet another escalation, to the ultimate level, the corporate Large Projects Office (LPO) in Geneva.

The LPO was to SI what J K Rowling's Dementors were to Hogwarts. Their job was to suck all joy, and possibility of success, out of a project. I have no idea whether they actually delivered any Large Projects, but I doubt it. Within a week they had doubled all the existing estimates and added yet another layer of program management and the rest. The project had now reached a size - approaching 1000 person-years, 10 times any realistic estimate - that just flat-out terrified the country management. A project on this scale, if it went wrong - which was just about guaranteed - could take the whole company down, and certainly result in some very senior people needing to seek new career opportunities.

The whole team was called together in a large meeting room. DEC had decided to no-bid the project. Permanent employees would be reassigned as soon as possible, while all contractors were terminated immediately.

This is where things got surreal for Dave and myself. We went to some project management type and pointed out that there was no provision for termination in the purchase orders we had received. DEC had bought 90 days of my time, and 180 days of Dave's, just as if they had bought a thousand cases of beer.

"That," said the project management type, "is covered by our standard terms and conditions. It is implicit in the purchase order."

"Maybe," we replied, "but what isn't in the contract, isn't in the contract. The only contract we have is the PO. And the PO has no mention of any standard terms and conditions."

They quickly accepted that we had a point. "OK, but in that case you will have to accept to work on any other project for the duration of the contract."

That was fine by us. A week or so later, such a project cropped up. We started reading documents and figuring out what was needed. But a day later we were called into our original project manager's office.

"The new project is refusing to pay your rates. They are much higher than the normal DEC contract rate, and they won't pay. They say we have to pay because we agreed to the abnormal rate. But we're not willing to subsidize other projects. So you must stop work on the project immediately. And you are not to work on anything else either. You are forbidden to work on any project except the one you were originally hired for." And that had been cancelled, there was absolutely nothing to be done for it. So we were forbidden to do any work at all.

It was Dave who came up with the name "The Doing Nothing Contract". I was commuting from Nice at the time, flying to England on Monday and back on Friday. DEC insisted on me being at the office to do nothing - I was not permitted to do nothing from home. That was the beginning of the weirdest few weeks of my professional life. I'd found a hotel just outside town, an idiosyncratic converted farmhouse with low rates and enormous rooms furnished entirely from estate sales of dubious quality.

Dave (who did live fairly locally) and I would roll up to the office around 10, read the news and chat for a while, then around 11.30 go off to a pub for lunch. We'd be back by 2.30, spend another hour or two nattering (no web to surf back then), and go home. I still had plenty of friends from when I lived in Reading, so I never spent an evening on my own. I put on about five pounds during the brief period this lasted.

After a couple of weeks, I got another call from the project manager.

"Look, this is silly." I agreed. "How about if we pay off half the remaining contract, and call it quits?"

That was fine by me, I already had other work lined up and this was just free money. I left that afternoon and didn't return. Dave made the same suggestion, but his contract had a lot longer to run, so they said no.

We stayed in touch. A few days later, they called him in and told him they would simply terminate the contract "in breach", which is to say they would just stop paying him. Dave had been at DEC for years and knew exactly how things worked on the inside.

"But if you do that, I'll sue."

"Sure, yes, off the record, that's what we'd advise."

"And DEC never contests things like that, so you'll just settle, and end up paying the full amount, plus costs."

"That's true. But that will come out of a different budget, not ours."

Even they could see the silliness of all this, though. Soon afterwards they settled with him on the same basis as myself. He walked away with tens of thousands in unexpected cash, and went back to his day job doing IT projects for insurance companies. And that was the end of the Doing Nothing Contract.

Monday, 13 May 2013

On typename - and why C++ is a parser's nightmare

If you've done any significant C++ programming using templates, you'll certainly have run into the annoying rule requiring you to write "typename" before constructs of the form "class::member" if the class is actually a template parameter. For example:

template <class C>
class foo
{
typedef typename C::value_type value_type;
};

If you miss out "typename" the compiler will complain. What's more, if it's GCC it will complain in a completely mysterious fashion, giving you no clue as to what the actual problem is.

And yet, surely it's obvious that this must be a typename? Why require all those extra keystrokes and visual clutter for something which is obvious? Every now and then I'd wonder about this, and read something which described the obscure situations where it isn't obvious. But I'd promptly forget, until the next time I spent ages pondering over unhelpful error messages until the little light came on - "aha, it wants a 'typename'!".

There's a good explanation of why actually it isn't obvious, to the compiler at least, here. But it took me trying to explain C++ parsing to someone for me to really get it.

C++ teeters on the hairy edge of total ambiguity the whole time, without you even realising it as a user. And one of the worst culprits is the apparently innocent reuse of the less-than and greater-than symbols as template brackets. Consider the perfectly clear snippet:

foo<bah> foobah;

It's blindingly obvious that this is declaring 'foobah' to be an object of class 'foo' instantiated with 'bah' (presumably another class) as its template parameter.

Well, except that if all three names are actually ints, those template brackets suddenly turn into relational operators. It's not a very useful code snippet, but it is syntactically correct. First compare foo with bah, creating a bool result. Then compare that with foobah, having first cast the latter to bool. Then throw the (pretty meaningless) result away.

You don't even need templates. The reuse of '*' for both multiplication and dereferencing can also lead to ambiguity. Combining the two can get exciting:

foo<bah> *fbptr;

Obviously a declaration of a pointer to a 'foo<bah>'. Unless of course foo and bah are both numeric and fbptr is a pointer to a numeric. Then it's a replay of the previous example.

This is all made worse because C++ (necessarily) allows you to refer to class members before they're defined. Consider the following example:

class c1
{
template<class C> class d1
{
....
};
class e1
{
....
};
};

class c2 : public c1
{
void fn()
{
d1<e1> f1;
}
....
int d1, e1, f1;
};

When the parser sees the apparent declaration of f1, everything it knows tells it that this is indeed a declaration. Only later does it come across other declarations that completely change the interpretation. I wonder how compilers deal with this - it would seem necessary to hold two completely different parse trees and hope that one of them will make sense later. Just to make it more interesting, the class could also go on to redefine the '<' and '>' operators, so they apply to different classes.

Even lowly Fortran IV wasn't immune to this kind of problem. It had no reserved words, and gave no significance to spaces. So when the compiler saw:

DO 100 I=1

which is obviously the beginning of a DO statement (equivalent to 'for' in C++), everything hinges on the next character. If it's a comma, this must indeed be a DO statement. But if it's say '+', or there is no next character on this line, then it's an assignment to a variable called 'DO100I' - and of course Fortran also didn't requires variables to be declared, so they could just pop up like this.

I'm glad I don't write compilers for a living!

Monday, 12 December 2011

Algorithm Design: Efficient LDPC Encoding (Part 4: Optimization)

In Part 3, I described the implementation of the algorithms for reducing an LDPC code to an encodable form. At that point, all the algorithms were the most efficient possible. The only remaining performance gains would come from improving the code. This is rarely worth spending too much time on, but in this case the overall performance is completely dominated by two inner loops. One iterates through a sparse representation of a row, adding it to a dense row. The other iterates along the elements of the dense representation, adding them. Halving the time spent in both of these loops - just a few instructions each - will halve the execution time of the whole algorithm. So it's worth taking a close look.

Let's start with the add-sparse-to-dense loop. The original code used conventional STL iterators to scan through the elements of the sparse row, then for each element, converted it to the offset-and-mask combination for the particular bit number, and applied it using an xor operation. It's the obvious way. But each sparse row is added to a dense row tens of thousands of times, so it's worth considering whether any part of this operation can be amortized.

The final solution was to pre-calculate a vector containing the offset-and-mask for each entry. The latter was actually represented as a class, called "bitref". In the source code, this results in a vector<bitref>, which is iterated through in the usual way. The compiler is nevertheless clever enough to inline all this and reduce the inner loop to just four machine instructions: two to extract the offset and mask, one to perform the xor operation, and one to advance to the next entry. Not bad. Performance was improved substantially, reducing the time for phase 2 of the algorithm by a factor of about three.

There remains the overhead of the loop. Given the tiny size of the content, the two instructions of loop overhead, and their impact on pipelining, are worth worrying about. In assembler, the obvious way to unroll the loop would be to lay out sequential instructions corresponding to its maximum size, then jump into these at the appropriate point corresponding to the loop size (i.e. the number of entries in the vector). This is difficult to do in C++ though.

In the end I came up with something similar, but with a distinct unrolled loop for each possible size. This was done as my first tiny adventure in C++ template metaprogramming, and is described here. The compiler is smart enough to translate the switch statement based on vector size into an indexed jump, which then executes the entire loop in a straight line. That gave me about another 5% improvement.

Having got the inner loop as tight as possible, it was time to think about the next layer of the loop. Gcc does a good job of inlining functions when it's the right thing to do, but examination of the assembler output (-S option) showed that it was not inlining a couple of critical functions here. I played around with the compiler parameters that control inlining for a while and things got a little better, but I could just not convince it to inline one critical function. Of course the "nuclear option" of making it a macro always exists, but I really wanted to avoid that. I tried the "flatten" function attribute for the outer loop, which tells the compiler to inline absolutely everything, but after the compiler had run for half an hour or so I stopped it. I think it got put off by all the calls to boost::format that I use in my debug log macros.

Eventually, I found a minor rearrangement of functions that got everything inlined. That gave me another 5% or so performance improvement.

That dealt with the inner loop of phase 2, adding sparse rows to dense rows. In phase 3, the inner loop is adding dense rows to dense rows. Unrolling this loop was easier since it is always over the whole length of a dense row - over 64K bits, or 2K operations. There's nothing to be gained by completely unrolling such a loop. Instead I changed the code to do it in "gulps" of 16 entries at a time, then used a normal loop to deal with the remainder at the end. I also rearranged things here so that the call to the inner loop was fully inlined.

And that is about as far as things can be taken. The original C++ code took about 400 seconds for a column-weight 3, 32K data bit code. The final code takes under 7 seconds. I never ran a column-weight 5 code to completion with the original code - it would certainly have taken thousands of seconds, maybe much more. But now, it runs in about 45 seconds.

Of course there's a price to pay for all this. One of the first principles of writing maintainable systems is never to keep the same information in more than one way. This code violates that all over the place - for example, the sparse and dense representations of rows. But without this kind of approach, the code would be unusable anyway, so its maintainability wouldn't matter much. It has certainly been one of the most interesting bits of programming I've undertaken in a long timer.