Tuesday, 10 November 2020

My Little List of Useful Principles

This first appeared on my web page, but I thought it deserved to be repeated here on my blog. 

The David Stone Principle

“Never ask a question that has an answer you may not like.” Also expressed as “It is easier to obtain forgiveness than permission”. In other words, don’t ask if it is OK to do something, because the chances are there will be someone who will have some reason why you shouldn’t, and having asked the question (and got an answer you don’t like), you have placed yourself under an obligation to do something about the answer. Whereas if you just got on and did it, you could deal with any objections afterwards. This has two advantages. First, you’ve already done what you intended, and it is pretty unlikely that you will be made to undo it. Second, people are less likely to object after the fact anyway.

Harper’s Theory of Socks

Everybody who has ever packed a suitcase knows that no matter how full the suitcase, no matter how difficult it is to close, there is always some crevice where you can squeeze in one more pair of socks. Those familiar with the Principle of Mathematical Induction will immediately see that it follows that you can put an infinite number of pairs of socks in a single suitcase.

If this is obviously fallacious, it is less obvious why. But in any case it is a useful riposte to the executive or marketing person who wants to add just this one tiny extra piece of work to a project.

Law of Ambushes

I heard this one from Tony Lauck, but he claims to have got it from someone else. Think of an old-fashioned Western, with the good guys riding up towards the pass. They know the bad guys are up there somewhere, and they’re looking every step of the way, scanning the hilltops, watching for any movement, peering around twists and turns in the trail. Suddenly there’s a dramatic chord and the bad guys appear from nowhere, guns blazing. Of course the good guys triumph, except the one you already figured was only there to get shot, but the point is, ambushes happen and take you by surprise even though you expect them, even though you’re waiting for them every second. And they always come from where you weren’t expecting and weren’t watching.

The Lauck Principle of Protocol Design

This one is a little technical, but it is so fundamentally important to the small number of people who can benefit from it, that I include it anyway. Communication protocols (such as TCP) work by exchanging information that allows the two, or more, involved parties to influence each others’ operation. When designing a protocol, you have to decide what information to put in the messages. It is tempting to design messages of the form “Please do such and such” or “I just did so and so”. The problem here is that the interpretation of such messages generally ends up depending on the receiver having an internal model of its partner’s state. And it is very, very easy for this internal model to end up being subtly wrong or mis-synchronised (see the Law of Ambushes). The only way to build even moderately complex protocols that work is for the messages to contain only information about the internal state of the protocol machine. For example, not “please send me another message”, but “I have received all messages up to and including number 11, and I have space for one more message”. There are legitimate exceptions to this rule, for example where one protocol machine has to be kept very simple and the other is necessarily very complex, but they are rare and exceptional. As soon as both machines are even moderately complex, this principle must be followed slavishly.

The Lauck Principle of Building Things That Work

If you don’t understand what happens in every last corner case, every last combination of improbable states and improbable events, then it doesn’t work. Period. Yes, you may say, but it is too complex to understand all of these things right now. We will figure them out later as we build it. In this case, you are doomed. Not only does it not work, but it will never work.

The Jac Simensen Principle of Successful Management

Get the right people doing the things they’re good at, and then let them get on with it. It sounds simple, but it is rarely done thoroughly in practice. It's applicable to all levels of management but especially at more senior levels where there’s a lot of diversity in the tasks to be undertaken.

The Principle of Running Successful Meetings

Write the minutes beforehand. If you don’t know what outcome you’re trying to achieve, you stand little chance of getting there.

Harper’s Principle of Multiprocessor Systems

Building multiprocessor systems that scale while correctly synchronising the use of shared resources is very tricky, Whence the principle: with careful design and attention to detail, an n-processor system can be made to perform nearly as well as a single-processor system. (Not nearly n times better, nearly as good in total performance as you were getting from a single processor). You have to be very good – and have the right problem with the right decomposability – to do better than this.

Harper’s Principle of Scaling

As CPU performance increases by a factor of n, user-perceived software performance increases by about the square root of n. (The rest is used up by software bloat, fancier user interface and graphics, etc).

The Delmasso Exclamation Mark Principle

The higher you go in the structure of an organisation, the more exclamation marks are implicitly attached to everything you say or write. So when a junior person says something, people evaluate the statement on its merits. When the VP says it (even in organisations and cultures that aren’t great respecters of hierarchy and status, like software engineering), everyone takes it much more seriously. It means that as you move up the organisation, you have to be increasingly careful about what you say, and especially you have to be increasingly moderate (which doesn’t always come naturally!).

The Dog-House Principle

A dog-house is only big enough for one dog. So if you don’t want to be in the dog-house, make sure somebody else is. I first heard this applied to family situations (specifically to someone’s relationship with his mother-in-law) but it seems more generally applicable.

Mick's Principle of Centrally Managed Economies

There are three reasons why centrally managed economies don’t work. The first is obvious, the second less so, and the third not obvious at all. This principle was formulated by a friend of mine during the dying days of the Soviet Union. Its applicability to centrally-managed economy is obvious, but it should be borne in mind whenever an organization’s success model involves the slightest degree of central planning.

The first problem is that they assume a wise central authority that, given the correct facts, can figure out the right course of action for the next Five Year Plan. It is fairly obvious that such wisdom is unlikely to be found in practice.

The second problem is that even if such a collection of wisdom did exist, it would only succeed if given the correct input. In the case of the Soviet Union, this means the state of production in thousands of factories, mines and so on, as well as the needs in thousands of towns and villages. But all of this input will be distorted at every point.

The lowliest shopfloor supervisor will want to make things look better than they are, while the village mayor will make things look worse so as to get more for his village. And at every step up the chain of management, the information will be distorted to suit someone’s personal or organizational agenda. By the time the Central Planning Committee gets the information about what is supposedly going on, it has been distorted to the point where it is valueless.

The third problem is the least obvious. Suppose that by some miracle an infinitely wise central committee could be found, and that by another miracle it could obtain accurate information. Its carefully formulated Five Year Plan must now be translated into reality through the same organizational chain that amassed the information, down to the same shopfloor supervisor and collective farm manager. At every step the instructions are subject to creative interpretation and being just plain ignored. The Central Tractor Committee, knowing the impossibility of getting parts to make 20,000 tractors, adds an “in principle” to the plan. The farm manager, knowing that his people will never get enough food supplies to live well through the winter, grows an extra hundred tons of corn and stocks it. And so on.

Acknowledgements

Tony Lauck led the Distributed Systems Architecture group at DEC, and was my manager for several years. As a manager he was pretty challenging at times, but as a mentor he was extraordinary. He had (still has, I guess) the most incredible grasp of what you have to do to get complicated systems to work, or perhaps more accurately what you have to avoid doing. At first encounter, spending a whole day arguing over some fraction of the design of a protocol seemed like pedantry in the extreme. It was only later that you came to realise that this is the only way to build complex systems that work, and work under all conditions. With the dissolution of DEC, the “Lauck School of Protocol Design” has become distributed throughout the industry, to the great benefit of all. A whole book could be written about it, citing examples both positive and negative – were it not for the fact that Tony is still very much alive, BGP for example would have him spinning in his grave.

Jac Simensen was my boss (or thereabouts) at DEC for several years. It would be an exaggeration to say he taught me everything I know about management, but he was the first senior manager I saw in action from close-up, and one of the very best managers I’ve ever worked for. He certainly gave me an excellent grounding when I quite unexpectedly found myself managing a group of nearly 100 people, by a long way the biggest group I’d ever led at the time.

Saturday, 12 September 2020

Bread



Today's Loaf

At the start of the shelter-in-place order for the Bay Area I decided to try my hand at making bread. Me, and tens of millions of others. I got started thanks to a friend who gave me a bag of Italian Doppio Zero flour, and thanks also to a small pack of yeast I happened to have. Both ingredients had completely disappeared from supermarket shelves. I found a recipe on the web - which turned out to be seriously flawed. Still, my first effort was pleasant to eat, and encouraged me to keep trying.

Six months have now passed. I've made bread twice every week since then, on Friday and Sunday mornings, which amounts to about 50 loaves. I think that now I've got the hang of it. There are really only two ingredients in bread, flour and water, plus of course yeast. Yet there are amazing variations in what you get with only small changes in the ingredients.

But I'm getting ahead of myself. My two-pound bag of Doppio Zero was quickly exhausted. We had some all-purpose flour, but bread should be made with proper bread flour, which has a higher protein content than normal flour. The protein is what turns into gluten, which is what gives bread its structure and texture. Normally you can buy it in the supermarket, but not in March 2020.

Looking online, I discovered a high-end flour producer (Azure) who claimed to have ten-pound bags of bread flour available. I ordered one, and hoped it would arrive quickly. But it didn't. When I chased them, they assured me it was on its way, but delayed due to the problems arising from the pandemic. That seemed fair enough, but it didn't help me.

I looked some more, and discovered that I could get a fifty-pound sack of flour from King Alfred, the top name in flour in the US. It seemed crazy to buy that much, but it didn't cost all that much and it would solve my problem. I placed the order, intending to cancel the Azur order when the new order shipped.

You can guess what happened. Literally within minutes of the King Alfred confirmation, Azure sent me a shipping notice. The two showed up within a day of each other.

The First Attempt - just wheat flour,
and horribly over-hydrated
I had a few packets of supermarket yeast, but given we couldn't know when bread ingredients would reappear on the shelves, I needed more. Through a similar sequence of events as the flour, I ended up with two packs of yeast as well, a total of three pounds - enough for about 160 loaves. On the bright side, it keeps for a long time. Incidentally the Fleischmann stuff doesn't make very good bread.

Tricks


A bunch of tricks I've learned along the way...

One thing you quickly discover with bread is the importance of the "hydration", which is to say the amount of water. Too little gives you a very dense bread, while too much delivers decent bread but the dough is a sticky mess that won't hold any kind of shape. I've found 71% works very well, for example 340 ml of water with 480g of flour. This may seem over-precise, but when on occasion I've got sloppy and used an extra 10ml (2%) of water, the dough is really different.

Early on I tried adding hazelnut flour to the normal wheat flour. I add 30g of it to 450g of bread flour. That gives a delicious nuttiness to the taste, and also contributes to the crispness of the crust. I tried walnut flour too. That gives a different taste and less of a crust, but it's interesting too.

At first I tried to knead the bread by hand. It's very satisfying, but it takes a long time and makes your wrists ache. Now I put the flour (and a tiny amount of salt) in the mixer, add the yeast starter, then slowly trickle in the remaining water while the mixer runs. I leave it for ten minutes, occasionally stopping the mixer to scrape the dough off the mixing hook. After that a quick, one minute hand knead finishes everything off and gets the dough to the right texture.

Personally I like bread to have a crisp, crunchy crust. It's tricky to get that to come out right. It all has to do with the way the starches react in the early stages of baking. Industrial bread ovens have a mechanism for injecting copious amounts of steam at the right time. The idea is that in the early stages, the surface is kept moist by steam condensing on the relatively cool dough. This promotes the right reactions in the starch, leading eventually to the Maillard reaction which turns starch and sugar into delicious light brown caramel.

Since I don't have an industrial oven, I have to improvise. I put a shallow pie dish of water in the oven when I turn it on. By the time it is at its operating temperature of 500°F (250°C), this is boiling nicely, creating a very humid atmosphere in the oven. Then, when I put the bread in, I empty half the water onto the floor of the oven. This fills it with steam (and generally makes a bit of a mess on the floor too). I leave the pan in the oven for the first five minutes of baking time. When I open the oven to remove it, a hot blast of scalding steam emerges - showing that it has done its job.

I cook hazelnut bread for a total of 29 minutes, 5 with water and the rest without. This results in a perfect, crunchy crust, just beginning to turn deep brown in the darkest places along the top, yet moistly soft inside. Walnut bread does better with a couple of minutes less. Really the goal is to take it out just before it burns.

It has been a challenge to get bread to be the right shape, which for me means roughly circular and 3-4" (80-100 cm) across. If you stretch the dough to the shape you want, it has an annoying tendency to have "memory" and go back to its original shape in its first couple of minutes in the oven. Finally what I have found works is to flatten the dough, as part of the final "knocking back" which removes over-large bubbles. I work on the flattened, pizza-like dough to get it the right length, then fold it over and roll it like a giant sausage roll to get the circular shape.

Even so it happens sometimes that a loaf "explodes" - it develops a big split along one side. This doesn't affect the flavour but it's not very pretty. Cutting slits across the top, half an inch or so apart and quite deep, helps a lot. The other important thing is to make sure the dough joins together properly. Generally I sprinkle flour around when working with dough. Thats coats the surface and makes it stick less, but it also stops it sticking to itself when you roll it up. A sprinkling of water (not much!) helps, and massaging the join together.

I generally split off some of the dough to make a couple of rolls. About 80g of dough gives a little roll, perfect for breakfast, with a disproportionate amount of deliciously crunchy crust.

At first I had problems with bread sticking to the baking tray. A piece of parchment paper covering the tray solves that problem. Surprisingly, considering that the ignition temperature of paper is famously "Fahrenheit 451", it chars a little at 500°F but doesn't burn.

Recipe


I use the following ingredients to make a "one pound" loaf:
  • 450g of King Arthur bread flour
  • 30g of ground hazelnut flour
  • a pinch of salt (about 3g - the amount is fairly critical and a matter of personal taste)
  • 8g of yeast
  • 5g of sugar
  • 340ml of water
The water and flour can be adjusted as long as they are in the same proportion.

I mix up the yeast and sugar along with 30ml of water and 20g of flour and leave them somewhere warm (around 40°C, 100°F) for 15-30 minutes. That gets the yeast going well. This is mixed in with the remaining solid ingredients and the remaining water prior to kneading.

Since getting up at 4am isn't really my thing, I make the dough the evening before. Once it is kneaded, I leave it to rise for a couple of hours, then put it in the fridge overnight. I generally get up briefly around 6am, and use that to get the dough back out and let it warm back up to room temperature by the time I do the final stages starting some time between 8 and 9. A couple of times I have started too late for that, and left it out overnight. It doesn't seem to make much difference to the final result.

Flour


I was very surprised, a couple of weeks ago, to realise that I was near the end of my fifty pound sack of King Arthur flour. When it ran out I switched to the ten pound bag of Azure. This turned out to give a completely different bread! The Azure flour is grey rather than white. The bread is denser, tastes different, and has a less crunchy crust. Obviously this is all a matter of personal taste, but both of us greatly prefer the King Arthur flour. Now that flour is easy to obtain again, I have bought another ten pounds of King Arthur. That seems to give even better results than the original sack, though I have no idea why.

Sourdough


My friendly English baking neighbour once gave me a sourdough starter. This is supposed to have all kinds of mystical, magical properties. You have to feed it - to the point that if you go away for a few days, you have to arrange with the cat sitter to feed the sourdough as well. There's something very primal about it all, which I think is its attraction.

It also totally failed to work. Luckily some conventional yeast added just before going to bed did work. I was feeling a bit badly about what I'd say to my neighbour. Then she reported the exact same experience.

So much for sourdough.

Sunday, 30 August 2020

Some Network History - Open Systems Interconnection (OSI)

The standards for Open Systems Interconnection (OSI) were a big part of my job from 1980 until 1991. This is a very personal view of what happened, and why it all went wrong.

Background

It's hard to remember now that computers were not always networked together. When you buy a $10 Raspberry Pi, or a $50K server, it's connected to the Internet as soon as you turn it on. Not only can you find cute kitten pictures, but it will load new software and all sorts of behind-the-scenes things you probably aren't even aware of.

It wasn't always so. In the 1970s, "computer" meant a giant mainframe, typically with a whole building or floor of one to itself. They cost a fortune, and they were self-contained - they didn't need to communicate with anything else. The nearest thing to networking was "Remote Job Entry" (RJE) - typically a card reader and a lineprinter, with a controller, connected over a high-speed data line. High speed as in 9600 bits/sec, or about a thousandth of typical WiFi bandwidth. It would take a long time to load even a single kitten picture at that speed. These were used in places that needed access to the computer, but couldn't justify the cost of one - branch offices, remote buildings on a campus and so on.

Each of the mainframe companies - IBM and the "BUNCH" (Burroughs, Univac and others) - did RJE their own way. There were no standards or industry agreements, even though they were all doing exactly the same thing. Communication was over a "leased circuit" - a dedicated, and horribly expensive, telephone line directly between the two places. There was nothing that could be called a "network".

The company I worked for, DEC, was the pioneer for smaller computers - minicomputers. These were inexpensive enough that you could have several, which typically needed to share data - for example to run the machines in a factory. For this it had defined its own network architecture, called DECnet, which was the first peer-to-peer commercial network ever. It allowed DEC's VAXes and PDP-11s to communicate with each other, to share files, access applications and various other things.

They also needed to access data held on the mainframe. For this, we wrote software that pretended to be an RJE terminal. To get data, we would send a pretend card deck that ran a job to print the file, then intercept the "lineprinter" output. A similar ruse would send data in the other direction. At one point I was responsible for all these strange "emulation" products. There was one for the IBM 2780 terminal, and one for each of the other mainframe manufacturers. They were a nightmare to maintain, because none of these RJE protocols was documented. They had been worked out by reverse engineering the messages over the data link. So we were constantly running into special cases that the original code didn't know about.

X.25 - The First "Open" Networking

The first inkling of something better came along in the mid-70s. The world's phone companies - at that time still nationalised "PTT"s - had got together through CCITT, their standards body, and come up with something called X.25. This allowed computers to connect just like on the telephone or telex networks. No prior arrangement was needed, you just sent a message which was the equivalent of dialing a phone call, and then you could send and receive data.

My first networking job at DEC, in 1979, was to implement X.25 for the PDP-11 and the VAX. Just a few countries had networks - the UK, France, Germany, and the US, which had two incompatible ones. Although there was a "standard", it had so many options and variations that every network was different and needed its own variant of the software. It was also expensive to use, with a charge for every single byte of data. Getting a connection was a challenge, since the whole concept was such a novelty for the behemoth monopoly PTT organisations.

Apart from the technical difficulties of X.25, there was a much more fundamental problem. As one industry wit put it at the time, "Now I've taught my computers to talk to each other, I find they have nothing to say." There was no standard way to, say exchange files, or log in to a remote computer. Manufacturers could write their own, but that defeated the object of the "open" network in the first place.

There were a couple of efforts to improve this situation. In the US the Arpanet had been funded by the government in 1969, to connect research and government laboratories. It was this that ultimately led to the Internet, but that was a long way off in 1980. There was a similar effort in the UK, led by the universities, to develop standard protocols for common tasks. Each one was published with a different colour cover, so they were called the "Colour Book Protocols".

OSI is Invented

Having a different standard in every country wasn't a great idea either. International standards for all kinds of things have been produced by the International Standards Organization (ISO) since its creation in 1947 - everything from railway equipment to film standards (the ISO film speed for example). Their work included computers. ISO 646, also known as ASCII, was the first standard for character codes. It was the obvious place to put together standards that would be accepted world wide.

The effort needed a name, and "Open Systems Interconnection" (OSI) was selected. 

By then, the concept of protocol "layers" was well established. X.25 had three layers: the physical layer that dealt with how bits were sent across the wire; layer 2 (data link) that got data reliably across a single connection; and layer 3 (network) that took it through the network via what are now called routers. The first task of the ISO effort was to come up with a formal model of protocol layering. This is probably the only piece of the effort that anyone has still heard of, the "seven layer model" published in 1979 as ISO 7498.

The first four layers of the model - as described above, plus the "transport" layer 4 - were already well accepted and not controversial, though the details of their implementation certainly were. The last three layers were however more or less invented out of nothing and weren't aligned at all with the way application protocols were built, then or now.

The "session layer" (layer 5) was conceptually imported from IBM's SNA architecture, though all the details were completely different. It was extremely complicated, reflecting things like the need to control half-duplex (one direction at a time) modems. There wasn't a single application protocol that used it to do anything except simple pass through.

The presentation layer's overall goals were never very clear. What it turned into was a universal data metadata and encoding, called ASN.1. It was useful, in that it allowed message formats and such to be expressed in terms of datatypes rather than byte layouts. But it was vastly overcomplicated for what it did.

The OSI Transport Protocol

My own involvement with OSI started in 1980. Definition of the OSI transport protocol was taking place in an obscure Geneva-based group called ECMA. DEC wanted to be involved, and sent me along. My first meeting was at the Hotel La Pérouse in Nice. The work was already well advanced. To call it a dogs' breakfast would be a big disservice to both dogs and breakfasts. There were groups who thought the transport protocol should rely entirely on the network for reliability, and others who thought it should be able to recover from a limited class of errors. Other arcane distinctions, including the need for alignment with CCITT - the telco's standards club - meant had it had no less than four separate "classes", which in reality were distinct protocols having no more in common than a few parts of the encoding.

My task was to add a fifth. All of the work so far was intended to work in conjunction with X.25, which provided a "reliable" network service. If you sent a packet it would be delivered or, exceptionally, the network could tell you that it had been unable to deliver something. It would never (in theory anyway) just drop a packet without telling you, nor misorder them. DECnet, as well as the emerging Arpanet, made a different assumption. They kept the network layer as simple as possible, and relied on the transport layer to detect anything that went wrong, and fix it. That meant a more complex transport protocol. This incidentally is how the Internet works, with TCP as the transport protocol.

I spent the next 18 months designing the "Class 4 Transport Protocol" (the others were numbered from 0 to 3, don't ask), TP4 for short. It worked exactly the same as DECnet's equivalent protocol, NSP, and TCP, but the encoding had to be compatible, as far as possible, with the other classes. However the operation was completely different. Practically speaking, a complete implementation of the OSI transport protocol required five completely separate protocol implementations.

I got a lot of guidance and help within DEC, but at ECMA and later ISO I was on my own. Nobody else cared about TP4, nor understood it. That suited me perfectly. It was published in 1981 as ECMA-72.

Maybe because I was really the only one doing any technical work in the group, when the current chair was moved on to another project by his company, I was asked to take that on. It was quite an honour - I was only 28, in the world of standards which (as in politics) tends to be dominated by people towards the end of their careers. That also meant that I got to attend ISO meetings, representing ECMA, the beginning of a long involvement. 

ISO adopted the ECMA proposal for the transport protocol, all five incompatible classes of it, without any technical changes. It was later published as ISO 8073.

Around this time I took up DEC's offer to move to the US for a while, to lead a team building software to connect to IBM systems using their SNA architecture. At least, that was what I was told. In reality, they already had someone for the job, and I was just backup. That gave me plenty of time to work with the network architecture team there, the people responsible for the design of DECnet. The team was really smart and had a big influence on my career, at DEC and subsequently.

ISO meetings were held all around the world, hosted by the various national standards bodies (like BSI, ANSI and AFNOR) and their industry members like IBM and DEC. In those early days I went to meetings in Paris, London, California, Washington DC, Tokyo and others. 

The day before the California meeting, in Newport Beach, we had a very hush-hush meeting at DEC. It was the only time I was in the same room as the CEO and founder, Ken Olsen, along with our genius CTO, Gordon Bell, and our head of standards. The occasion was a meeting with the CEO of ICL, the British computer company which was still important then, and a high powered team on his side. ICL was convinced that IBM was trying to take over computer networking and impose SNA on the world. That would be a disaster for us, since SNA was very firmly oriented to the mainframe world and not designed for peer-to-peer computing at all. Ken was readily convinced that salvation lie in the creation of international standards that IBM would be obliged to follow, which is to say OSI.

This completely transformed my role in things. Until then, my standards work had been an interesting diversion, the kind of thing that large companies do pro bono for the good of the industry. I thoroughly enjoyed it but nobody at DEC really cared much. Suddenly, it was a key element of the company's strategy, with me and a handful of others at its heart.

In 1983 something extraordinary happened. We were invited by China to have our meeting there, the first international technical meeting that China ever hosted. That meeting, in Tianjin, deserves its own article.

The OSI Network Layer

Shortly after the Tianjin meeting there was a shake-up in the way the various working committees were structured, which left the chair of the network layer group (SC6/WG2) open. This was by far the most complex area of OSI. The meetings were routinely attended by nearly 100 people. It was also extremely controversial, and from DEC's point of view the most important area. I was astounded when I was asked if I'd be willing to chair it. I later learned some of the negotiations behind this from Gary Robinson, for many years DEC's head of standards and an extremely wily political operator. (He was responsible for the tricky compromises that allowed Ethernet and other LAN standards to go ahead despite enormous fundamental disagreement - Token Ring and Token Bus were still very much alive). In essence, the other possible candidates, all much more qualified and experienced than me, had too many enemies. I hadn't yet made any, so I became chair of what was officially ISO/IEC JTC1/SC6/WG2, the OSI network layer group, and went on to acquire plenty of my own enemies.

The problem with the network layer was a complete schism between the circuit view of things and the packet view. The telcos had built X.25, at great expense, and saw that as the model for the network. The user of the network established a "connection", and packets were delivered tidily and in order across the connection. The packet view, which included DEC, was that the network could only be trusted to deliver packets, and then not reliably, and should make no effort to do any more. It could safely be left to the transport layer to fix up the resulting errors.

In OSI-speak, these were respectively the "connection-oriented network service", or CONS, and the "connectionless network service", or CLNS. By the time I arrived there had already been years of debate and architectural hypothesis about how to somehow combine these two views. This had generated one of the most incomprehensible "standard" documents of all time, the "Internal Organisation of the Network Layer" (IONL, ISO 8648). The dust was just about beginning to settle on the only way forward, which was to allow the two to progress in parallel. There was no compromise possible.

The telcos hated this, because it pushed their precious X.25 networks down into a subsidiary role underneath a universal packet protocol, making all of their expensively engineered reliability features unnecessary. From our (DEC) view, this was far better than the complex engineering required to somehow stitch together an "internet" from a sequence of connections. Building a network router is hard enough. There's no need, or point, to make it even harder.

So by the time I was in charge of things, we had two parallel efforts. The CLNS side was led almost entirely by DEC, with excellent support from others in the US. As a result we were able to make rapid progress. We came up with a relatively simple protocol with no options, variants and all the other horrors than bedevilled OSI. It was standardized as ISO 8473, the Connectionless Network Protocol (CLNP). 

As chair, I had a duty to be non partisan. On the other hand, I had no duty to actively help the CONS camp. Between the complexity of X.25, the additional complexity of trying to use it as an internet protocol, and internal divisions within the camp, they had little chance of success. After years of work they never did come up with anything that could be built.

That said, this schism did enormous damage to OSI, and was a major factor in its ultimate demise. To us at DEC it was obvious that CONS was a doomed sideshow, but to an observer it just showed a complete inability to make decisions or come up with something that could be built.

DECnet-OSI

That really highlights the basic flaw of the OSI process. Creating complex technology in a committee just doesn't work. It's hard enough to get a network architecture right, without having to embody delicate political compromises in every aspect of the design. Successful standards like TCP, IP and HTTP/HTML were designed by a single person or a small group under strong leadership. Where possible, we did the same thing at DEC. For example the routing protocol for OSI, universally called "IS-IS", was developed by a small team at DEC, and it still works. With modifications to support IP as well as OSI, it is still used by many of world's large telcos. We managed to get that through the OSI process with hardly any changes.

At DEC we had whole-heartedly adopted OSI as the future of networking. DECnet, our very successful networking system, was rebranded DECnet-OSI and was to be completely restructured to use the OSI protocols. We even persuaded James Martin, a well-known author of IBM-oriented textbooks, to write a book about it. That probably deserves its own article too. As it turned out, DECnet-OSI never really happened. That was more to do with internal engineering execution problems than with OSI itself, since we carefully picked only the bits that could be made to work.

The OSI Transaction Processing Protocol (or not)

In 1987 I got involved in another part of OSI. IBM had never really tried to influence the OSI lower layers or to try to make them like SNA. But suddenly they came up with the idea of imposing it on the upper layers. SNA had a very complex upper layer structure, mostly oriented around traditional mainframe networking like remote job entry. But they had finally woken up to peer-to-peer networking and added something called LU6.2 to support it. Their idea was to make LU6.2 an integral part of OSI, so that all applications of OSI would in effect be SNA applications. It was a good idea from their point of view, and was very strongly supported by senior management there.

We knew this was coming because of the way ISO works. It started as a "club" of the national standards bodies, and to a large degree still is. This means that proposals can't be submitted directly to ISO, they have to pass through a national standards body - or at least, they did at the time, things have changed a bit since then.

The question was, what to do about it? IBM were heavily constrained by the existing standards and projects. If they had come along with this five years earlier, it would have been much harder to stop, but now they had to find an empty spot they could introduce it to. This they did, under the guise of "transaction processing". So at the 1987 meeting in Tokyo, there was a "New Work Item" for transaction processing, as another application layer standard. To this was attached all of the IBM contributions, which is to say LU6.2 warmed over.

I got a call about a month before the meeting from DEC's CTO, saying, "John, we need you to go and stop this." In the standards process it is almost impossible to stop anything. Once a piece of work is under way, it will continue. Actually terminating a project or committee is virtually impossible. Typically committees continue to meet for years after they no longer serve any useful purpose. So if you want to stop something, you have to either divert it into something harmless, or ensure that it makes no progress.

An experienced chair knows that there are some people who, while working with the very best of intentions, will just about guarantee that nothing ever emerges. It's just the way they're made. I have had the good fortune to know several. You may ask, why "good" fortune? The answer is that if you don't want something to work out, you arrange for them to be put in charge of it. I couldn't possibly say whether something like this may have influenced the failure of the CONS work to deliver.

For IBM's LU6.2 proposal, though, this would not work. They had put some technically strong people from their network engineering centre in La Gaude, France in charge of it. In truth I had little idea what I would do until I got to the meeting. It turned out that there were three camps:

  • IBM and others who liked the idea of LU6.2 being part of OSI
  • Those who thought that making it part of the standard would act against IBM's interests, by making it easier to compete with them. While these people were "enemies of IBM" and in some sense on the same side as me, as far as this meeting was concerned, they were my opponents. For example, France's Bull was in this camp.
  • Those who didn't want it. This turned out to be just me, and ICL.
So I was hardly in a position of strength. In addition, I hadn't been able to make any official contribution to the meeting ahead of time. On the other hand, the people IBM had sent knew little about OSI and the way the upper layers had evolved. They seemed to believe they could do as they had, for example, with Token Ring (and as DEC and Xerox had with Ethernet as well) - just show up with a spec and get it approved as a standard. But things had already gone way too far for that. There were already too many bits and pieces of protocols and services defined.

This was their Achilles' Heel. In the end it was remarkably easy to divert the activity to a study of the requirements for transaction processing (and it turned out there weren't any), and how they could best be met with existing OSI work. Only then would extensions be studied. This was instant death to the idea of just sticking an OSI rubber stamp on LU6.2.

That all makes it sound very easy, though. I was on my own against a large group of people who all wanted me to fail. It was one of the toughest things I'ver ever done. Luckily there were a lot of DEC people and other friends at other parts of the meeting, so the evenings and weekend were very enjoyable as usual. 

There was one person at the meeting who genuinely frightened me. He was incredibly rude and aggressive during the formal meeting, to the point where it became very personal. It was a ten minute walk from the meeting place, just opposite the Tokyo Tower, to our usual hotel, the Shiba Park. I spent those ten minutes looking over my shoulder to be sure he wasn't following me.

That had an interesting consequence. The head of the US delegation was from IBM, and very much of the old school. He was close to retirement and, like most standards people of that era, very much a gentleman. A few weeks later, I was invited, along with DEC's head of standards, to a meeting at IBM's office in New York City. There the IBM guy apologised profusely, and very professionally, on behalf of both IBM and the United States - even though the person in question didn't work for IBM.

I don't exactly remember what happened after that meeting, but I think IBM just quietly dropped the idea and it faded away.

OSI Management

DECnet had powerful remote management capabilities, essential in a networked environment. We knew that if OSI was to be useful, it had to have the same. There was a management activity but for years it had been very academic and gone nowhere. There were some smart people in the UK who wanted management to work too, and between us we came up with everything required: a protocol, and a formal way to specify the metadata. In the end it never got implemented, because OSI was already struggling by the time it was ready. But it was a nice piece of work. It also got me to several interesting places I otherwise would have no reason to go to.

Why Did OSI Fail?


My final OSI meeting was in 1991, in San Diego. By then I had moved to a new job in the company and was no longer involved with the DECnet architecture. In any case the writing was on the wall: the OSI concept would happen, but it would happen through the Internet protocol suite under development in the IETF. DEC officially made the change shortly afterwards.

Why was OSI such a total failure? It was the work of hundreds of network experts, many of whom really were the top people in their fields. Yet hardly a single trace of it remains. On the other hand the concept of universal computer interconnection has been a huge success, way beyond the dreams of the OSI founders. All they hoped for was the possibility of open communication, they didn't expect it to be a constant feature of the way we use computers. The only thing is, this is all done using the protocols developed by the IETF and loosely called TCP/IP.

OSI was way too complex, with too many options and choices. It was a nightmare to implement, made worse because this was before open source caught on. Some companies tried to make a living selling complete OSI protocol stacks, but that was never really a success. At DEC we had a full OSI implementation several years before DECnet-OSI, but hardly anyone bought it - only a few academic and research users.

I think the main reason was that there was no compelling use case. That seems hard to believe now, but in 1990 it was a chicken and egg situation - until the connectivity was available, there was no use for it. My old boss at DEC said the main reason TCP/IP took over was that Sun was shipping it as part of their BSD-based software, and it was just there, free and available. Because of that, people started to find uses for it. That also happened to coincide with the invention of the World Wide Web in 1990. It was only a minuscule shadow of what it has become, but was a reason to be connected.

By 1995 it was obvious that the future of networking lay with the IETF and TCP/IP. In Europe there were still efforts to keep OSI alive, but without manufacturer support they went nowhere. Around 1997 I was paid to write a study of why the IETF had been so much more successful than ISO. The simple answer is that while IETF is a committee, or actually a collection of numerous committees, each individual standard is produced by at most two or three people. It is then discussed and may get modified, but it is not "design by committee". That is less true now than it was in 1995 - all organisations tend to become sclerotic with age. But back then its motto was "rough consensus and working code". It got stuff done.

Conclusion


From a personal point of view, OSI was one of the most interesting things I've ever done. It taught me a great deal about how to lead in situations where you have absolutely no official authority. It took me on many, many journeys to fascinating places around the world. It also provided my introduction to the woman who would later be my life partner, though that isn't part of this story.

It can be endlessly debated whether OSI was a complete waste of time and effort, or whether it postponed open networking long enough for IBM's SNA to lose its predominant role, making room for TCP/IP. We will never know.

Thursday, 13 August 2020

The Doing Nothing Contract, or How Not to Run Large Projects

 Soon after I left DEC, in 1995, I got involved in what would have been the biggest Systems Integration (SI) project they had ever done. This is the story of the project.

I joined DEC when I left university, 20 years earlier. It was a fantastic place for an engineer to work, and I enjoyed nearly every day I worked there. But in 1995 it was obviously going downhill - it was acquired first by Compaq in 1999, and then by HP - and I found a way to make a decent exit. While I was looking for another job, I started a consultancy business which turned out to keep me busy for the next four years.

About a year later I ran into a former colleague on a plane. He told me about a project that they were working on for a major European telco. It was going to be huge. Did I know anyone who might be able to help? Well, yes, there was me. I think he knew that and was just being polite. I did point out that my daily rate was over double what DEC would normally pay. This wasn't a problem, he said, because they wanted to assemble an elite team of top-level architects to get the overall design right. Within a week I had a purchase order for three months of my time, 40 hours per week, at my usual high daily rate.

The following Monday I showed up at the DEC office in Reading, England. There were two other people in the "elite" team. One was Dave, who I knew quite well - like me, he had already spent 20 years as an employee.

The project was very interesting, on an oft-repeated theme. Telephone networks have always been built using proprietary systems built by specialized suppliers like (then) GEC and Alcatel, costing at least ten times more than normal contemporary computers. The client had figured out that this was just a big distributed computing application, and wanted to run their national telephone and data network using off the shelf computer hardware and, as far as possible, software.

It seems like a wonderful idea, but it has been tried several times and so far has never really worked (which I guess is a spoiler for this article). The problem is that, even now and certainly 25 years ago, these telcos were used to being almost their suppliers' only customer. They could make incoherent or outrageous demands, confident that their suppliers would have to follow. The price tag reflected this, but the people making the demands - the engineers - weren't the people paying the bills, so those dots never got joined up inside these huge bureaucracies. A typical interaction would go:

Supplier:   the project will use database X and transaction software Y
Telco:       that's no good, we need features P, Q and R that X and Y don't have
Supplier:  we could add those, but it's bespoke engineering and will add (lots) to the cost
Telco:       no, we want off-the-shelf software, we don't want to pay for custom development
Supplier:  X and Y is what's on the shelf, you want something else, you have to pay for it
Telco:       (utter incomprehension)

In its latest iteration, this has led to the ETSI NFV (Network Function Virtualization) project, which in its 8 years of existence, so far, has yet to deliver an actual functioning network.

Anyway... our mission was to use off the shelf DEC computers and software to build the switching control system at the heart of the network. It isn't really a hard problem. The basis of it is extremely simple: take a number that someone has dialled (this was a while ago) and translate it into a series of simple instructions to the physical switches, like "connect channel 92 of trunk 147 to channel 128 of trunk 256".

The only thing that makes it hard is the scale - this has to work for millions of users and concurrent calls. But even then, none of these actions have to be closely synchronised. It isn't like, say, Facebook. where something you upload needs to become instantly visible to a billion users around the world.

Within a week, Dave and I had figured out how to put the available software components together and have a working, scalable prototype within a couple of months. Turning that into a production system would be a much bigger job, needing integration to the telco's dozens of management systems, but that was all low risk, routine stuff. We started writing code.

What we had completely failed to take into account, working in our cosy little office, was the DEC bureaucracy. In a much larger open-plan office nearby was an already-large team of project managers, program managers, project documentation specialists, and for all we knew telephone sanitizers as well. They operated in blissful ignorance of any actual technical details, as they came up with the cost estimates that would be at the heart of the formal bid for the project.

DEC has always been thought of as a computer manufacturer, but they had a large and thriving SI business as well. Over the years they had built up a series of procedures and processes for managing these projects. They were mostly pretty small - integrate a driver for a new piece of hardware into an operating system, or build a user interface around a database application. But some were big - tens of person-years - and a few were really big, like our telco project.

So part of the process was to know when the project was too big for the current level of project management. When that happened, the project would get escalated to the next tier of project management. They would bring in their own team to look at the design, the business aspects, the risk, and everything else.

The first thing the new team would do is to multiply all the existing work estimates by two or more, just as a matter of principle because that is nearly always right. (A very successful SI company CEO who I knew years ago always multiplied all engineering estimates by pi. He claimed it worked every time). Then they would add a whole new layer of program managers, project documentation specialists, telephone sanitizers and all the rest. Then they would start looking at the details, invariably resulting in another factor of two or so.

Our project had already been through two such escalations. Realistically this was probably a 50 person-year project, but the estimates were already in the hundreds, maybe 20 times the original figure. As a result it triggered yet another escalation, to the ultimate level, the corporate Large Projects Office (LPO) in Geneva. 

The LPO was to SI what J K Rowling's Dementors were to Hogwarts. Their job was to suck all joy, and possibility of success, out of a project. I have no idea whether they actually delivered any Large Projects, but I doubt it. Within a week they had doubled all the existing estimates and added yet another layer of program management and the rest. The project had now reached a size - approaching 1000 person-years, 10 times any realistic estimate - that just flat-out terrified the country management. A project on this scale, if it went wrong - which was just about guaranteed - could take the whole company down, and certainly result in some very senior people needing to seek new career opportunities.

The whole team was called together in a large meeting room. DEC had decided to no-bid the project. Permanent employees would be reassigned as soon as possible, while all contractors were terminated immediately.

This is where things got surreal for Dave and myself. We went to some project management type and pointed out that there was no provision for termination in the purchase orders we had received. DEC had bought 90 days of my time, and 180 days of Dave's, just as if they had bought a thousand cases of beer.

"That," said the project management type, "is covered by our standard terms and conditions. It is implicit in the purchase order."

"Maybe," we replied, "but what isn't in the contract, isn't in the contract. The only contract we have is the PO. And the PO has no mention of any standard terms and conditions."

They quickly accepted that we had a point. "OK, but in that case you will have to accept to work on any other project for the duration of the contract."

That was fine by us. A week or so later, such a project cropped up. We started reading documents and figuring out what was needed. But a day later we were called into our original project manager's office.

"The new project is refusing to pay your rates. They are much higher than the normal DEC contract rate, and they won't pay. They say we have to pay because we agreed to the abnormal rate. But we're not willing to subsidize other projects. So you must stop work on the project immediately. And you are not to work on anything else either. You are forbidden to work on any project except the one you were originally hired for." And that had been cancelled, there was absolutely nothing to be done for it. So we were forbidden to do any work at all.

It was Dave who came up with the name "The Doing Nothing Contract". I was commuting from Nice at the time, flying to England on Monday and back on Friday. DEC insisted on me being at the office to do nothing - I was not permitted to do nothing from home. That was the beginning of the weirdest few weeks of my professional life. I'd found a hotel just outside town, an idiosyncratic converted farmhouse with low rates and enormous rooms furnished entirely from estate sales of dubious quality.

Dave (who did live fairly locally) and I would roll up to the office around 10, read the news and chat for a while, then around 11.30 go off to a pub for lunch. We'd be back by 2.30, spend another hour or two nattering (no web to surf back then), and go home. I still had plenty of friends from when I lived in Reading, so I never spent an evening on my own. I put on about five pounds during the brief period this lasted.

After a couple of weeks, I got another call from the project manager.

"Look, this is silly." I agreed. "How about if we pay off half the remaining contract, and call it quits?"

That was fine by me, I already had other work lined up and this was just free money. I left that afternoon and didn't return. Dave made the same suggestion, but his contract had a lot longer to run, so they said no.

We stayed in touch. A few days later, they called him in and told him they would simply terminate the contract "in breach", which is to say they would just stop paying him. Dave had been at DEC for years and knew exactly how things worked on the inside.

"But if you do that, I'll sue."

"Sure, yes, off the record, that's what we'd advise."

"And DEC never contests things like that, so you'll just settle, and end up paying the full amount, plus costs."

"That's true. But that will come out of a different budget, not ours."

Even they could see the silliness of all this, though. Soon afterwards they settled with him on the same basis as myself. He walked away with tens of thousands in unexpected cash, and went back to his day job doing IT projects for insurance companies. And that was the end of the Doing Nothing Contract.

Saturday, 1 August 2020

Building Cisco's Japan Development Center

Definitely the most interesting thing I did in my time at Cisco was to start an engineering team in Japan. How that came about is a story in itself.

My job at Cisco had nothing to do with Japan. I ran the router software group, IOS, which was a seriously full time job. It comprised about 500 people, mostly at the corporate HQ in San Jose, California, but plenty spread around the world including the UK, where I was initially hired, India, France, and several locations in the US. My move to the US coincided with a total, absolute hiring freeze after the crash of 2000. Senior management understood that we needed to grow the group, though, so every time some remote acquisition turned out to be surplus to requirements, we would acquire bits of the team. I had people in Colorado, North Carolina, up north in Sonoma County, and sundry individual contributors working from wherever they happened to have been hired.

Senior management was expected take on various odd tasks that had nothing to do with the day job. One such assignment that came my way was giving the opening keynote speech at the company's customer conference (Cisco Live) in Japan, in 2003.

Back in the 1980s, international standards work had taken me to Japan several times. During the 1990s my wife went there often for the same reason, and I would sometimes tag along. But this was my first visit since before I'd joined Cisco, in 1999. There were things I'd forgotten since my previous visit, like when the airport bus arrived at a hotel and the staff ran to meet it, then stood and bowed as it pulled away. You get used to this, but after a five year interval it surprised me again.

I was greeted very courteously by the head of marketing in Japan, who has since become a very good friend. It's assumed in Japan that foreigners will need their hands held at all times. It takes a lot to convince them that you can safely use the metro and railway system without getting lost. I think it would be a serious loss of face to mislay a visiting Vice President, so even if they believe you they are reluctant to let you try it. Consequently, I had been met at my hotel - the New Otani - and accompanied to the conference location.

I had a carefully prepared presentation - it had never occurred to me to ask for help or any kind of corporate guidance. But it was only in the hour before I gave it that I learned it was "the" keynote for the conference. I made a few hasty changes and it seemed to work OK. I spoke in English but there was simultaneous translation to Japanese. I asked the head translator, a very distinguished Japanese guy in his 50s, how fast I should speak. "About a quarter as fast as your CEO" was his answer. In fact it was very easy to pace myself. Every single person in the audience of hundreds was listening to the translation on headphones. There was enough sound leakage that I could tell when the translator had stopped, and start the next sentence.

That trip led to a couple more to meet Japanese customers, and that in turn led to a fairly surreal activity. We were trying to convince one of the big Japanese operators to switch to Cisco for their core network. Part of this was a technical collaboration around mobile networking for which I was the corporate sponsor. Every three months we would have a meeting, mostly in or around Tokyo though sometimes in the US, with half a dozen people from each side. Their technical team would present what they wanted to do, and how, and our technical team would respond. The two teams totally disagreed, but it didn't matter. At the end the customer's VP and I would both give little speeches saying how impressed we were with the spirit of cooperation and the progress that had been made. And then three months later we would have exactly the same meeting, with exactly the same presentations, and exactly the same speeches. We got some truly excellent Japanese meals out of it.

It was all worth it though. After several years of this, and long after I had left Cisco, we won the business, worth hundreds of millions of dollars.

You'll gather from this that I loved Japan - and still do - and was very happy to have good reasons to go there as often as possible. I got to know the country manager - now sadly no longer with us - quite well. Our only disagreement was over how I should address him. He wanted me to do it the American way, using his first name. In the Japanese culture first names are used only by childhood friends and immediate family, and even then not always. I just couldn't bring myself to do it, and always addressed him in the Japanese way as Kurosawa-san. If we had met in the US, it would have been different.

Over dinner on one trip he told me that it was his dream to have a corporate engineering activity in Japan. It was a constant ding against American suppliers that they did no R&D in the country, and he wanted to counter that. I thought it was a great idea, but when I presented it to my management back in California they practically laughed in my face. It would be expensive, we wouldn't be able to find or hire the right people, it made no sense, and so on. So that was that.

But Kurosawa-san was resourceful, and at some point, knowing that he had the practical support he needed from me, he managed to convince the CEO, John Chambers. That changed everything. Suddenly I was told to make it happen.

The biggest challenge was to find someone to run it. The key to any remote team like this is to find someone who understands the corporate thinking, and who also understands the local culture. Generally this is impossible, which is why so many remote teams fail miserably. I had a very lucky inspiration. One of the UK team which I'd inherited when I joined Cisco was bored and looking for something new. A Norwegian called Ole, he still had some Viking blood, one of life's adventurers looking for the next Big Thing. It helped a lot that he already knew some of the Cisco Japan team. He was signed up almost before I'd finished asking him.

The other big challenge was to assemble the nucleus of the team. But that turned out to be much easier than I'd expected. In 2004 Cisco's prestige was high and people were keen to be part of its product development team. There were people already working for Cisco Japan, who had taken jobs in support for want of anything better, who were happy to move into engineering. Through personal contacts we found engineers at Japanese companies who were happy to make the change. One of the new team was Japanese but working for Cisco in California. Quickly I had a nucleus who could be trusted to grow the team - although, in the end, it never did grow.

We had to put the team somewhere. Initially we borrowed some space from the country sales operation, in Shinjuku, but the hope was that eventually it would reach a hundred or more people. For that it made sense to think about a location outside Tokyo, which led to our "fact finding" trip to Kanazawa in Ishikawa prefecture, and our amazing lunch with the prefectural governor that I've written about before.

Everything came together in spring 2005. We had a team, we had an office, and we had someone to run it all. I went to Tokyo for three weeks to get it all started, and luckily my wife was able to come with me. Rather than stay in a hotel, we rented a very pleasant apartment in the Aoyama district of Tokyo. It's the closest I have ever come to living in Japan. We shopped for food in the local supermarkets, an interesting experience for my wife who neither speaks Japanese nor can read any of the characters. Most things can be identified from pictures on the labels, but she needed my help to distinguish salt from sugar and flour. We had a wonderful time there, one of the most memorable trips of my life.

For the next year or so, I visited Japan every three months. It led to some complicated itineraries, since I generally combined them with a visit to the team I still had in the UK. Cisco had rented an apartment in Tameike for Ole and his wife, absolutely vast by Japanese standards, a ten minute walk from the New Otani where I stayed on every trip. Each room was bigger than a typical Tokyo apartment. I spent many memorable evenings there, though the next mornings were sometimes a bit hazy. Apart from the team itself, he'd built a "support network" in Japan who helped him and all of us with every aspect of things.

Since I was in Japan so often, I got to know the country sales team well too. I visited several important Japanese customers as "the man from HQ". I would sit there in total incomprehension as the "fireside chat" meeting ran its course, but apparently just my presence made a big difference.

It was important for the team to know their colleagues in California, and we arranged for them all to visit at the same time. The trip happened to coincide with Halloween, and we arranged a fancy-dress party at home. One of the team's hobby was traditional Japanese kimono, and she had brought with her a complete outfit. She looked amazing, delicate and beautiful in the Japanese tradition, and definitely took first prize by popular acclaim.

I left Cisco about a year later, and Ole decided to return to Europe at the end of his two year contract. A local manager was hired. But by then Cisco had lurched into much more aggressive expense control, and the planned expansion never happened. The country manager retired, a victim of corporate politics - the destiny of all who reach the senior ranks of Cisco. With no sponsors left, the group lingered on for a surprisingly long time, but in the end a bean-counter somewhere spotted it and its destiny was sealed. Some of the engineers returned to non-engineering roles, some moved to the US, and some left the company.

The country manager, who I got to know well, once said to me "When you make a friend in Japan, you make a friend for life." And it's true. Even now, fifteen years later, I still have good friends there who I see whenever I get a chance to visit.

Sunday, 5 July 2020

The Garden Railway: Trouble with Turnouts

The design and construction quality of LGB equipment is astoundingly good. You can leave the trains outside in all weathers with no damage or deterioration, whether from rain, snow, intense heat, or UV radiation. The locomotives will pull heavy, friction infested trains all day long without complaint. If anything does break, even the tiniest moulding is available as a spare part, albeit at a high cost. The track too is tough as anything - you can step on it without damage, and electrically it works far better than you could reasonably expect.

But nothing is perfect. The one place where their attention to quality seems to have lapsed is the pointwork, those necessary but fiddly places where trains get a choice of direction. They're eye-wateringly expensive - roughly $100 for a new, electrically operated turnout. But with LGB you just have to get used to that. The problem is, they just aren't that well designed. There's a sort of pervasive optimism, a feeling of "it'll be alright on the night", that applies to every aspect of the design: electrical, mechanical and trackholding.

My garden railway currently has a total of 17 LGB turnouts, all electrically operated via my NCE DCC controller. All the ones on the main running lines are 16xxx medium radius. There is a yard with the small, 600mm radius 12xxx turnouts, mostly bought new 20 years ago. The others are a mix of new, at various times over the last 10 years, and some eBay bargains, of which the oldest was probably 40 years old.

Keeping them all in good working order, so the trains run over them smoothly without derailing, jerking, or just coming to a halt, requires constant attention.

Electrical Problems


All of the electrical side of LGB stock suffers from a degree of design optimism. There are simple rubbing contacts everywhere, for example between the pickups, motors and other electrical connections. The wires are made of brass, which slowly forms an insulating oxide layer on the surface, so intermittent electrical problems slowly arise as the trains gets older, especially when they live outdoors.

The slider pickups on the locomotives are a case in point. The idea is excellent, but the connection from the slider to the rest of the electrics depends on a fragile spring, wound with wire barely thicker than a human hair. If ever there is a short circuit in the engine, the spring heats up to the point where it loses its temper - which is to say it stops being a spring, so the pickup stops working. It's possible, but very fiddly, to replace the springs, and to make it more complicated a different part is needed depending on the particular locomotive.

This pervasive electrical optimism really strikes hard on the points. The outer running rails are solid brass, connected to the adjacent track by heavy, springy fishplates. No problem there. But the connection to the switch rails - the ones that move - is made by very primitive sliding contacts under the rail. This works way better than it deserves to when the track is new, but as it ages the contacts and the rail itself oxidize, and the force holding it all together weakens. The net result is that trains hesitate or flat-out stop as they are going over the points.

It doesn't help that there is a lot of dead track. The place where the two rails cross - the "crossing" or "frog" depending on your train-speak dialect - would ideally be connected to one rail or the other depending on the point setting. This is difficult to arrange, and LGB didn't try. These sections are made of insulating plastic, meaning that one wheel, at least, stands no chance of picking up power. Four wheel locomotives, like "Shiny", our Wismar railbus, are especially vulnerable - the more wheels the better.

The diverging rails are connected invisibly, under the sleepers, by metal strips that are spot-welded to the running rails. They also are a bit optimistic. On several of my older points these welds have failed, leaving a lengthy piece of rail with no connection.

Underside of turnout showing
soldered connecting wires
The solution to all these problems is to make soldered connections between the various pieces of rail. This is a bit daunting since large-section brass rails conduct heat away from the joint area very effectively, and there is an obvious risk of melting the plastic base. To deal with the second problem first - the plastic used for the bases is pretty resilient to soldering. It softens, but doesn't melt, when you heat the rails up. It will melt and burn instantly if you touch it with the hot iron, though.

I've been pretty successful soldering fine wires to the underside of the rails. My technique is:
  • start by cleaning the metal very thoroughly, with a fibreglass "scratch brush", until it is gleaming
  • then cover the joint area in non-corrosive resin flux
  • I use a 50W temperature controlled iron, set to its highest temperature of 425°C, with a substantial chisel-shaped bit about 7mm across - providing plenty of reserve heat
  • hold the iron flat against the rail, holding it as far as possible from the plastic, and hold the solder against the iron - when melted it acts as a heat transfer fluid
  • now hold the iron in place until the rail is hot enough to form a proper joint with the solder. It's easy to see this because the blob of liquid solder suddenly spreads out on the metal
  • now add the wire, then hold it in place with a screwdriver or similar until the solder solidifies again. This will take a while - up to 30 seconds - because of the heat retained by the rail
  • Don't touch anything! - the rail stays painfully burning hot for a long time afterwards.

Mechanical Problems


In real life, track is held on the sleepers by some kind of spike driven into the wood, which either directly holds the rail, in US practice, or holds a metal plate which in turn presses on the base of the rail, in Europe. (It's different for serious railways, with high speeds and heavy trains, but light and narrow gauge railways work like this). The LGB track provides a good visual impression of this, but it really isn't very strong. The rails are held in place by tiny flaps of soft plastic, less than a millimetre thick. It takes very little to twist and break them.

On normal track this isn't really a problem. The sleepers all support each other, so they aren't subject to high stresses. And even if one does break, the rail is still supported by those around it. Points are a different story though. For example, the very first sleeper, closest to the moving switch rails, is the only one supporting the point motor and the first few inches of rail. It can easily get broken, and when it does, the vertical relation between the fixed rail and the moving one is lost. Trains fall off the track as a result.

It's impossible to repair the track base. What I have found effective is to glue the rail in place on the damaged sleepers, using the remains of the simulated spike. The plastic is something soft and difficult to stick to, but I have found a two-part epoxy that works well, Loctite EA9340. I originally bought it to make some repairs in the kitchen, where prolonged exposure to steam softened regular hardware-store epoxy, but it seems perfect for this too. Another advantage is that it dries to a murky dark green, making it pretty much invisible on the track.

The technique is simple. First get everything as clean as possible. Clean the rail with a fibreglass brush, and swab everything with alcohol. Then mix up some epoxy and make it into a blob around the base of the rail, so it looks like part of the sleeper. If several sleepers are damaged on the same point, do it for all of them.

Sometimes you can't blame LGB. One of my points was hit by a heavy steel ball, from playing French bowls (petanque) in the garden. The rail was badly twisted both horizontally and vertically, and many rail fastenings were broken. After I dismantled it and straightened the rail out, the epoxy worked perfectly to hold the rails in place. The repaired point is back on the layout, and trains pass it without problems.

Trackholding Problems


In Victorian times facing points - ones where the train has a choice of which way it goes - were regarded with horror. Railway designers went to great lengths to avoid them on main lines wherever possible. Where they were unavoidable, they always had facing point locks, which held the switch rails firmly in place while a train passed over them. They were interlocked with the signals, so it was impossible to clear a train to pass over the points unless the locks were in place.

Sadly our LGB points don't have these devices. They are held in place rather feebly by the magnets in the point motors. It's quite common to have a tiny gap between the fixed and moving rails - a fraction of a millimetre, but enough to cause problems. If a flange rides over the sharp end of the rail it can move the rail under it, opening the point and dropping into the gap on the wrong side. The rest of the train inevitably derails when this happens.

I haven't found a really good solution to this. Some point motors work better than others. I had one point that would consistently cause derailments. It was an old one, from eBay, with an older design of point motor. Replacing the latter with a newer motor held the rail in place much more firmly, and solved the problem.

The ideal, in the absence of an actual lock, would be a really firm over-centre spring mechanism, but I can't see an easy way to do this. In any case the force produced by the point motor probably wouldn't be enough to overcome it.

LGB four-wheel carriages and trucks have pivoting axles, to simplify going round the tight 600mm radius curves. Normally these are held at the correct angle by the traction on the coupling, but that doesn't work if the train is being pushed. And sometimes they get stiff. So they will occasionally end up trying to go through a point when the wheels aren't aligned correctly with the track. This makes the above problem a lot worse. It causes another problem, too.

In real life, points have check rails, or guard rails, which ensure the wheels go the right way through the "crossing" or "frog", where the two rails cross. The check rail presses against the back of the wheel and stops it slipping into the wrong, diverging flangeway.

Unfortunately the check rails on LGB points are mostly decorative. They are way too far from the rails to be really effective. Mostly this doesn't seem to matter, but on the three-way point they are not only too far away, but not in the places they need to be. There are so many problems with this item that it deserves an article to itself.

Sunday, 14 June 2020

The Garden Railway: Fabian the Krokodil

You know how it is with eBay when you bid on something. You have the winning bid for days, then in the last ten minutes a bunch of robo-bidders go to work and the item sells for twice what you thought you'd get it for. So I was very pleasantly surprised to get a message telling me that my bid for our latest LGB locomotive had succeeded. I'd been thinking about getting a new train for the garden railway for a little while. I was very tempted by LGB's DR Class 99 2-10-2 steam engine, a huge monster of a thing which I once saw in real life on the Hartzquerbahn in the former DDR. But the new production is selling for nearly $2000, and isn't available yet anyway. That seems like a lot for a toy train.

So when I spotted a Rhätische Bahn Krokodil at a very good price, I just couldn't resist. My winning bid was less than a quarter the price of the steam engine. And a few days later he showed up, and the fun started.

The original Rhätische Bahn Krokodil, class Ge 6/6, in a suitably Swiss setting
The real-life Krokodils appeared in 1921. The Rhätische Bahn, which operates (still) an extensive metre-gauge network in the Swiss Alps, was a pioneer of system-wide electrification. Starting in 1919, and completed by the mid-1920s, the overhead line covered the whole network providing electricity at 11,000 volts and 16⅔ hertz.

They acquired some extra-powerful locomotives, officially called Class Ge 6/6, to pull the heaviest trains. The design, with a snout at each end holding a giant electric motor, was needed because of the sharp curves on the line. Similar, but larger engines had already been built for the Swiss main line. They had even longer, flatter snouts, and the name Krokodil (meaning, of course, Crocodile) was an obvious choice, applied subsequently to all similar engines. The driver sat in one of the cabs at either end of the central section, enjoying excellent protection in the event of a crash yet also excellent visibility. The centre section held a huge, heavy transformer, to convert the high voltage scraped off the wires by the pantographs into something like 600 volts for the motors.

All of our engines have names - Marcel the diminutive French 0-6-0 who was the subject of my intelligent locomotive experiments, Helmut the beefy Hanomag 0-6-6-0 Mallet, and so on. A few years back we visited Cancun, in Mexico. We had dinner one evening on the terrace of a restaurant overlooking the lagoon. When the kitchen was closing the cook came out and threw a whole chicken into the lagoon, as he softly called "Fabian, Fabian". There was a splash and a scurry in the water. The chicken was gone. Fabian was their semi-tame crocodile who lived in the lagoon and showed up for dinner every evening. Later we could see him gently rocking to the dance music that was playing downstairs. And so Fabian was the obvious name for our Crocodile.

The big challenge with these old LGB locomotives is converting them to run with the DCC control system on my railway. This puts a permanent 18V AC on the rails, superimposed with a control signal that tells each engine what to do. The advantage is that you can have as many trains as you want, some moving and some standing still, without any complicated wiring. A little decoder in each engine understands the control signal and turns the 18V AC into something suitable to drive a DC motor. It's hard to imagine operating a serious-sized railway without it, nowadays. Modern locomotives are designed to connect easily to a decoder, with the electrical pickups from the track wired separately from the motors.

But the old LGB ones, like Fabian, date from the time when the rails were connected directly to the motor, which was operated by a variable DC voltage on the track. The connection is typically buried deep inside the workings, and so it is with Fabian. I approached the open-gearbox surgery with trepidation, after reading web articles about the number of tiny parts that were just waiting for a chance to leap across the room and get lost in the carpet. In the event it wasn't too difficult. The internal connections are made by complex shaped brass strips that rub against all the right places, and it just took a couple of cuts to separate them. It was also necessary to run an extra wire from each bogie into the main body, so there could be three completely separate circuits: track power, motor, and lighting.

Like all older LGB locos, Fabian has a fearsomely retro-looking circuit board full of randomly placed through-hole components and massive hand-soldered tracks, that operates the lights from the track power. Rather than attempting to reproduce what it does, I prefer just to give it a fake track power feed and leave all the existing light circuits untouched. A little bridge rectifier produces 18V DC, which is then fed via a reversing relay to the circuit board. I have a stock of bistable relays, that remember their last setting mechanically, left over from another project (30 years ago!). All it takes is a couple of diodes feeding the actual motor supply from the decoder into the two relay coils, so the lights reflect the last way the train ran.

At least, that's the theory, and the way it works on my other engines. But for some reason the relay didn't want to cooperate. I think it must be defective, but anyway I ended up building something a lot more complicated, packed into board space that wasn't really available. It was a nightmare to get it to work, and I still don't understand why.

Finally everything worked. Reassembling Fabian's body was an interesting challenge. The mechanical design is ingenious, as always with LGB. To remove the circuit board required removing the back of the driver's cab. That is held in place by the hinge of the opening driver's door. Reassembling that, with the tiny spring that holds it shut, requires about six hands. With only two hands, it can eventually be done with patience and lots of not-in-front-of-the-children language.

Fabian, with his mentor Helmut looking on
Fabian looking very purposeful
Finally Fabian was ready. While he was on the bench, his train had shown up. LGB has a strange policy of constantly changing what is available new. They have made RhB wagons on and off, over the years, so finding a complete train requires a search of eBay, new production, and dealers' old stock. For now I have managed to accumulate three, with another on its way from England. Helmut generously agreed to lend Fabian his bright yellow banana wagon, but with a condition - he also had to take the track cleaner, a rather nondescript affair loaded with logs to held it scrape the oxide layer off the rails. It's surprisingly effective, but hard to pull and inclined to fall off the track. I'm sure Helmut believes he got the better part of the deal.

So now I can watch Fabian touring our garden at his stately speed - in real life he is limited to 55 km/h, even when pulling the optimistically named Glacier Express. Thanks to the wonders of JMRI and the Raspberry Pi Zero W, I can sit and sip my pastis and control him from my iPhone - but that is another story.

Fabian with his Rhätische Bahn train along with the borrowed banana wagon and track cleaner



Saturday, 23 May 2020

Modelling a Pandemic

Summary


Inspired by the observed differences between actual Covid-19 data and the predictions of the classic "SIR" model, I built a detailed pandemic model. It's a person-by-person simulation, for up to 50 million people, rather than a mathematical model. It uses the tricks I've accumulated writing high-performance network software to do this at reasonable speed. The results closely match actual data, and allow control over variables like vaccination, social distancing and self isolation. Skip to the last section to see the results. The source code is available on Github.

Background


Soon after the Covid 19 pandemic started, Mark Handley at UCL in London started a website showing the development in the number of cases and deaths in various countries and places around the world, which I followed with great interest.

Out of curiosity, I put together a simple simulation based on the SIR model of infectious transmission. This is the basis of much of epidemiology, and yet it quickly struck me that the curves it generates didn't correspond at all with Mark's. Using a log/linear scale (i.e. the Y-axis is logarithmic), SIR gives a straight line up until very close to saturation, where nearly all of the population have been infected. This corresponds to the now much-discussed "R0" number, i.e. the number of next-generation victims who will be infected by a single sick person.

Yet Mark's graphs weren't like that. No matter which country or area, nor which policies were being followed there, they all showed a gradual reduction in the slope. This was true for Lombardy and for much less afflicted places. It was true for Spain, which quickly enforced a very strict lockdown, and for Belarus, which never edicted anything at all. Even knowing exactly when lockdowns were put in place, it was difficult or impossible to see an inflection in the curve.

It was easy enough to come up with a mathematical model which closely described these curves. It is sufficient for the "R0" number to decrease slowly over time, to get a close match. The resulting curves perfectly matched the actual data from Lombardy.

So, why is reality different from the this nearly one century old model? It doesn't take long to see an obvious weakness. The math behind the model is simple:

    infected on day n+1 = constant * infected on day n * susceptible on day n

where the constant is closely related to our old friend R0. It's a simple differential equation, which indeed predicts the exponential growth everyone talks about.

But wait a moment. This supposes that every infected person is equally likely to infect every susceptible person. Is this realistic? Suppose Bob, living in New York, get sick. Alice lives in Los Angeles with her vulnerable elderly mother. She gets sick too. Which of them is more likely to infect Alice's mother? Or Bob's drinking buddy? In fact, the SIR equation only makes sense when applied to relatively intimate groups of people - families or close friends, for example. For larger communities, it needs to take into account also the probability of contact between individuals.

Herd Immunity


Everyone is now familiar with the concept of "herd immunity". If enough people are immune, a single infected person can no longer infect enough for the outbreak to grow. R (not R0 any more) has fallen below 1. It's widely stated that herd immunity needs to be somewhere around 60% to be effective, depending on the value of R0.

But again, this assumes that everyone is equally likely to infect everyone else - that herd immunity will not protect New Yorkers from each other if it has not yet reached the necessary level in Idaho. This obviously makes no sense. That situation may be bad news when an infected New Yorker visits Idaho, but New Yorkers will be protected from each other.

People tend to operate in clusters: families, groups of friends, close colleagues. Within such a cluster, transmission is high: if one person in a nuclear family, or one person in a small office, gets sick, the others will all be heavily exposed and most likely will either get sick or develop an immune response. Hence the admonition, common even before Covid, for sick people to stay home from work.

Once all the people in such a cluster have been exposed, the cluster has a localised form of herd immunity, regardless of what is going on in the wider world. An infected stranger visiting an exposed family, for example, poses no risk.

Thinking along these lines led me to the idea of "fractal herd immunity" - that there can be herd immunity at a local level, or in a larger community, without it applying globally. There is leakage between clusters - the nuclear family goes to visit the parents and cousins, close colleagues are part of  a larger company. Friendships especially are "leaky", it's common enough that my friends have friends who I don't know, or barely.

Building a Model


I wanted to develop a model to test this idea and see if the results look like reality. I didn't see any mathematical way to do this, so I built a simulation. Each person in the population is simulated, with day-by-day exposure to the infected people around them.

The model creates cities, and populates them according to a realistic distribution - a few big cities, and lots of small ones. Each person is allocated to each of four different clusters: family, friends, work and the local community. The latter corresponds to things like shopping. Clusters are randomly sized; for example, the family cluster can be anything from 1 (people living alone) to 8, with a bias towards smaller sizes.

Influence is a parameter for each cluster. In a family, people are in close proximity. This is taken as an influence of 1. The community cluster has a much smaller influence, but the clusters are a lot bigger. Cluster size is extremely important, because transmission increases as the square of cluster size: more people are exposed to more infected people. This is why large gatherings have been such an effective way to spread Covid, like the religious groups in north-east France and Korea.

Infection has to get between clusters. Partly this is done by grouping them into larger clusters, with reduced influence between the members. So a sick person in one family cluster has a small chance of infecting someone in an adjacent cluster, just as if they visited another part of the family.

Most cluster memberships and relationships are within the same city, but there are also ways for infection to travel between cities. This can be via cluster membership, for example when a family's relations are in another city, or when an office is part of a larger company based elsewhere. It can also be through explicit travel. Each person is randomly assigned a mobility, which is the probability that they will visit another city on any given day.

The model has a few higher-level parameters that control it, and a lot of detailed ones that are manipulated by the higher level ones. For example:
  • population: the total number of people - the model can handle 50 million, and gives rapid results for 3 million
  • infectiousness:  the R0 value for the infection
  • auto-immunity: the number of people who, when exposed, will develop immunity without becoming sick or infectious
  • distancing: the extent to which people's behavior is modified by social distancing
  • vaccination: the number of people who are immune at the start due to prior vaccination 

Results


The results can be presented either as a simple graph, showing the number infected and total infected as some parameter is changed, or as a rather fetching animation where each city is shown as a "bubble" gradually changing color as people are infected and either become immune or recover (or die). Here are some examples.

Social Distance

This chart shows the effect of varying "social distance", i.e. reducing interaction with friends, work and the local environment, and reducing travel between cities. Zero means business as usual. 1 is Wuhan style - everyone at home except truly essential workers, no travel. It is modelled by varying the level of interaction between people and the groups they belong to. At distance equal to 1, the family group is unchanged, friends drops to zero, work to 20% (allowing for some essential workers), and travel to 10%. Intermediate values of distance set intermediate values of interaction.

Somewhere around 0.8 or 0.9 is what was achieved in the UK or California, with less travel and contact, but still some, and some people trying to disregard the restrictions altogether. This level very substantially "flattens the peak", as well as reducing the total number infected. The chart also shows that anything less than 0.5 has no impact.

R0 - the Infection Ratio


This chart shows the effect of different values of R0, the number of people infected by each infectious person (using a log scale for the Y-axis). At 1.0, the infection dies out slowly. In some simulations, depending on the randomness, it lingers for a long while without growing much. At 1.2, the infection grows very, very slowly. This is a consequence of the local immunity effect - with the traditional SIR model, it would grow very quickly. Covid-19 supposedly has R0 in the range 2.5-3.

Vaccination

This chart shows the effect of vaccination. Just 20% vaccination halves the total number of people infected, and the size of the peak. At 50%, the infection is wiped out. This percentage is a product of the number of people vaccinated, and the effectiveness of the vaccine. So if 50% of people get vaccinated, and the vaccine is 80% effective, that corresponds to 0.4 on this chart.

Watching the Pandemic


The simulation can also show graphically how the pandemic spreads. The picture above is at day 100 with some typical parameters. Each blob is a city (some large cities are drawn on top of their smaller neighbors). The red ring corresponds to current infections, the blue outer circle to those who are still susceptible. The inner green circle is those who are no longer susceptible - recovered, asymptomatically immune, or vaccinated. The small black dot in the middle corresponds to deaths. Clicking here will show the complete evolution of the pandemic (select 1080p for best results).