The code worked differently when the moon was full (2021)

pacaro · 2024-02-23T04:48:33

In the mid 90s I worked for a small ISV that wrote at GIS that ran on windows 3.1 (a 32 bit operating system that ran 16 bit processes (that could have 32 bit code segments))

One of our customer required perfect uptime, which was a problem with the GetTickCount wraparound described here

Our solution was to run under NuMega SoftICE [1] and once a month, or so, dispatch an engineer to the customer to simultaneously patch the kernel value for the tick count and also the expected value in our software (and also clean up various handle based cruft)

This worked for years. Unfortunately the engineer in question was also an alcoholic, so a particularly spectacular bender spoiled an approx 3 year uptime

[1] https://en.wikipedia.org/wiki/SoftICE

bruce511 · 2024-02-23T05:14:40

>> One of our customer required perfect uptime

In all honesty, these are the kind of clients I politely point at someone else. They seldom have the resources to fund the requirement, and one absurd requirement leads to another.

I'm sure they had good reason, and good on you for pulling it off, but trying to keep Windows 3.1 up for multiple years sounds like a risky task to take on. Who knew there wasn't another time-sensitive bug in there that kicked in at say 365 days?

To be fair the insane requirements I've come across are just delusional so they're easy to walk away from.

ponderings · 2024-02-23T06:08:23

Twice I chat with someone who worked exclusively with customers no one else wants. One just grinds them down into a normal project, the other transparently charges what is reasonable for the unreasonable. $5000 for the first chat 24/7, for that money he explains how it is normally done, what it costs and what it costs to meet the insane requirements. If they chose normality he doesn't want the job.

pacaro · 2024-02-23T07:39:54

We had a CEO who would sign up just about anyone as a customer. He nearly destroyed the company a few years before I joined by underbidding on a contract that they were completely unqualified to fulfill. Software failed and people died (a government inquiry was scathing about the company's unsuitability for the contract but did ultimately exonerate them w.r.t. the fatalities). Lessons were not learned.

wtracy · 2024-02-23T10:06:27

Now you have my curiosity piqued. Is the report from the inquiry available to read online?

darekkay · 2024-02-23T07:57:03

There's a repo from danluu, collecting such debugging stories [1]. I've had my own peculiar bug: "Script crashes before 10 a.m." [2]

[1] https://github.com/danluu/debugging-stories

[2] https://darekkay.com/blog/script-crashes-before-10/

trescenzi · 2024-02-23T14:32:30

I came to the comments looking for something like this. Honestly I’d buy a collection of debugging essays.

Do anyone else know of other places to find stories like this?

Tangurena2 · 2024-02-23T16:39:57

One book that you might like to read is: If I Only Changed the Software, Why is the Phone on Fire? https://www.amazon.com/gp/product/0750682183/

Const-me · 2024-02-23T02:42:43

Interesting bug, but I don’t like the fix.

Most computers are 64 bits for decades now. It takes same resources to add/subtract/compare int64 compared to int32. No reason to use 32-bit integers for time, when I write C++ I use GetTickCount64() API introduced in Vista.

The C# equivalent is Environment.TickCount64, but there’s no need because unlike many other languages, C# standard library comes with a good support for dates, times, and time intervals. Instead of integers for elapsedInterval and requiredInterval variables, I prefer TimeSpan type. Safer, more readable, and potentially higher resolution because TimeSpan keeps int64 number of 100 nanoseconds ticks.

samplatt · 2024-02-23T06:27:50

>Most computers are 64 bits for decades now.

Desktops and laptops and servers, sure. There's an awful lot of 32bit processors in industrial controllers.

adrian_b · 2024-02-23T11:25:22

True, but even in any 32-bit microcontroller program any hardware tick counts and any high-resolution timestamps or other high-resolution time quantities should be handled using 64-bit integer types, e.g. "int64_t" or "long long".

All modern compilers support such types, so the choice does not require any work for the programmer. Even on 32-bit MCUs, the extra overhead for using 64-bit time quantities is almost always negligible, because normally only additions and subtractions are done with these values. Multiplications and divisions, which are slow on 32-bit MCUs, are used very seldom for time values.

The failure to use such types is a serious programmer mistake, which does not have any excuse, already for at least three decades or more.

In microcontrollers it is even easier to always use the right types for time values, because normally there are no backward compatibility concerns.

Many microcontrollers have only small hardware tick counters, e.g. 16-bit counters, but it is easy to extend them in software to 64-bit by having an interrupt on overflow that increments the high words of the 64-bit software tick counter (when reading such split software-hardware tick counters, modifications in progress must be detected and the read must be retried, like for any other lockless data that are shared between concurrent threads).

omoikane · 2024-02-23T07:35:05

The bug depends on system clock, but it's a bit of a stretch to say that it depends on moon phase.

Unless they were using that system to play NetHack:

https://nethackwiki.com/wiki/Time#Moon_phase_and_date

emmet · 2024-02-23T12:18:30

I mean they aren't saying that, just that it almost lines up and if you didn't know any better you might draw the conclusion.

layer8 · 2024-02-23T13:03:41

They stated a correlation, not a causation.

yavor-atanasov · 2024-02-23T12:52:09

Reminds me of this "We can't send mail more than 500 miles" story - https://web.mit.edu/jemorris/humor/500-miles

riffraff · 2024-02-23T06:28:07

I thought this was a reference to the old "phase of the moon" bug entry in the Jargon file[0] and it was very confusing to see it marked 2021. Still, good bug!

0: http://www.catb.org/jargon/html/P/phase-of-the-moon.html

brazzy · 2024-02-23T11:53:43

Ahh, that's what it reminded me of...

Miiko · 2024-02-23T04:23:18

I strongly believe that "having the two fields get initialized to Environment.TickCount" is the correct fix and using unsigned arithmetic is just a workaround. If you save the time of some event in some field, you either need to initialize that field with current time, or have special processing of "value not set" everywhere it is used, otherwise you're almost guaranteed to have some hard to debug problem somewhere in the long run.

moi2388 · 2024-02-23T07:21:54

I agree

mynegation · 2024-02-23T02:09:33

“int currentTimeMs” - whoa, I wish the compiler stop me there.

Dwedit · 2024-02-23T02:58:39

Yeah, you pretty much need to treat a millisecond timestamp as an opaque value, only using the Int you get when you subtract two of them. Doing that completely avoids the issues from timestamps being negative or wrapping around.

I've seen games that crash if your timestamps were negative. So if your computer has been on for a while, and a game isn't running properly, reboot.

offices · 2024-02-23T10:16:03

The most recent C++ standards have become extreme with this. A time point from the filesystem is a different type to a time point from the OS, which has a different valid range.

Karellen · 2024-02-23T11:04:53

I was intrigued by this, and looked up:

https://en.cppreference.com/w/cpp/chrono

Yeah, it is a bit chaotic.

...but, given that there often isn't just one "a time point from the OS", and thinking about the difference clock types on Linux (CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc...[0]), clocks and calendars are just messy. If you try and simplify them, you miss some edge cases which always end up being important to someone.

I guess there are just a bunch of tradeoffs that need to be made, and no matter which ones anyone makes, a lot of other people would have made different ones.

[0] https://manpages.debian.org/bookworm/manpages-dev/clock_getr...

offices · 2024-02-26T11:03:55

Following my initial, post-compiler-upgrade disgust, I do think it's a wise tradeoff. I can't use clock_cast yet - so the workarounds are kludgy - but it's that kludge that itself tells you you're dealing with something complex.

vishnuharidas · 2024-02-23T08:28:40

Another interesting debugging story - https://news.ycombinator.com/item?id=32709045

userbinator · 2024-02-23T02:28:06

currentTimeMs is Environment.TickCount, which in this case happens to be negative.

It's worth noting that GetTickCount() is documented as returning an unsigned count, and unless you really need to time intervals > ~49 days to millisecond accuracy, everything works as expected with modulo arithmetic.

Windows' AppVerifier has an option that causes GTC to rollover much sooner, specifically to test for such bugs.

sealeck · 2024-02-23T01:06:15

Cool case! Always really interesting to read about other people's debugging cases and strategies. I can also 100% recommend https://www.whyprogramsfail.com/, a book which looks at the theory behind writing correct software.

bombcar · 2024-02-23T02:03:40

Someone somewhere had a QA list of things to check/correlate, and one of them was - failure after 49 days.

mrspuratic · 2024-02-23T09:40:36

Better still, start the clock so wrap-around is imminent and inevitable: https://lwn.net/Articles/22874/ "make jiffies wrap 5 min after boot" That stops a whole lot of silly failures right there.

Dwedit · 2024-02-23T21:46:30

Windows Application Verifier has a feature to test the timer rolling over.

walrus01 · 2024-02-23T09:22:39

I was hoping this would be a post about some bug that manifested itself in scada systems for tidal gauges or something actually affected by the physical position of the moon.