Sex, Drugs & Unix

Home » Archives » September 2004 » Microsoft, GetTickCount(), LAX and the FAA

[Previous entry: "HTML O' the day"] [Next entry: "Marriage of the desperate"]

09/23/2004: "Microsoft, GetTickCount(), LAX and the FAA"


Doc wonders outloud at IT Garage Did the air traffic control center really have a "Microsoft server crash"?

The only apparent source for the TechWorld story is the LA Times story. I suppose, as "The UK's Infrastructure & network knowledge center" (says the slogan), TechWorld felt a need to alert its readers to a danger that the LA Times writers buried down toward the end of their story. I don't know.

So now I'm wondering... Is this "design anomaly" thing grounds for criticism of anything other than the design decisions behind some software? By the LA Times report, the software itself didn't fail, right? This wasn't a blue-screen kind of thing. It was a weird default, rather than a "server crash". In other words, it's something that's also correctible in software as well.

I bring all this up becuase I'm not sure this is one of those cases where Microsoft deserves the bashing it often (and sometimes deservedly) gets.

Is it?

Maybe.

Here are two stories, both independent of the LA Times writer (both authored by the same guy).

Here is the CNN story (which may or may not be independent) as well as another Potentially independent, blog-like write-up.

The issue is likely related to the GetTickCount() function in the Windows API. This function counts the number of milliseconds since the OS was last booted, and can "rollover" after 49.7 days.

/home/jim> bc -lq
2^32/(86400*1000)
49.71026962962962962962

Although it won't crash, Win2K can have a highly similar problem if your application depends on RPC.
----
SYMPTOMS
The Rpcss.exe process consumes 60 percent or more of CPU time, and system performance and network performance are affected. This symptom typically occurs 49.7 days after the server is started.
CAUSE
This problem occurs because a call to the GetTickCount timer function causes the function to overflow 49.7 days after the server is started.
----
rpcss.exe is reponsible for Remote Procedure Call services on the local machine. Perhaps the Harris software depends on RPC. Remember that the LA times story said that there was a resource problem, "'to prevent data overload", perhaps this is the source of it.

There are other Microsoft-written Windows apps (even for Win2K) that offer similar bugs.

If the RPC server issue (above) is not the culprit, then it is likely that problem here is the software written by Harris does not handle a rollover of the GetTickCount() function.

If so, the poorly written Harris software contains a bug and the FAA-mandated solution (which was really not that bad of a work around) was to manually reboot the system every 30 days, but as a fail-safe, they had a scheduled task to do a reboot on the 49th day just in case. The 49th day came because of procedural error.

Still, Harris claims; "The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999"

If I have to reboot a server every month, and meet 7-9s reliability, it had better reboot quickly. .0000001 of a month is 0.27 seconds, and that assumes the month has 31 days.

/home/jim> bc -lq
31*86400*(1-0.9999999)
.2678400

The real issue is of couse the culture around Windows that assumes that a reboot is the appropriate fix for software coding errors. The programmers at Microsoft assumed that no machine would stay up 50 days. I've seen busy Unix and Linux boxes th at can go years between reboots.