Although it didn't make a huge splash, the Air Traffic Control system in Southern California was out of commission for several hours at the end of last month (September, 2004).
According to this post (http://garage.docsearls.com/node/view/459), it appears related to the fact that the 32-bit millisecond tick counter in the Windows OS rolls over every 49.7 days.
The designers of the system failed to account for the possibility of rollover. That was their first mistake.
For some reason, rather than correct the basic programming problem, the system designers established a manual procedure (reboot the system every 30 days) to work around it, compounding their mistake.
Per Finagle's Law, human error led to a failure to reboot the system before 49.7 days elapsed. There was no second check (manual or [preferably] automated) to verify the procedure was being followed, introducing a single point of failure to the system.
The goal is a system which just runs. In this case, the system designers failed to achieve that goal, even though there was the typical "Microsoft Windows is crap" commentary which obscures the fact that it was poor design at the heart of the matter:
- Failure to account for rollover
- Failure to double-check manual procedure was being carried out (preferably automatically)
- Failure to recognized manual check as single point of failure
I would have really loved to have seen the console warning message, though:
Warning: this Windows system has been running for more than 30 days. Please reboot now.
Of course, the right answer would be to correctly account for rollover. The exercise is left for the reader.
In the overall context of systems that just run, this is an interesting question:
How would you build a clock that ran for 10,000 years?
Note the consideration (and abandonment) of "human ritual" as one way of keeping the clock wound for 100 centuries.
Recent Comments