The horror of regressions! Just about nothing hurts more than putting momentous effort into resolving a nasty bug, only to have it come back to bite you again a few months later. I whined about our Top Four Worst bugs in my August 2011 BOTM post. The one that came back in March was a variant of The Time Bomb. Internally we call this bug “The Watchlist Error,” since the error message you’d get mentions the Injection service being unable to inject the core module specified in its process “watchlist” by BoundsChecker’s launch manager. The absolutely frustrating thing is the regression involved zero code changes (so it was not a side effect of any other work) and it survived all of our batteries of regression tests on myriad Windows desktops, servers, and laptops. How could something like this escape into product again, which it did with our DevPartner Studio 10.5.3 point release?
The nitty gritty detail of The Time Bomb revolves around security protection we use with digitally signing all our product binaries, and then using an internal check right before injection to ensure the injectible DLL’s signature is valid and authentically the Micro Focus certificate. The first Time Bomb went off when an older Compuware-era certificate expired and our security check effectively shut down BoundsChecker from working on all platforms. This second mini-Time Bomb is more subtle. Our Micro Focus certificate got updated within the DevPartner Studio 10.5.3 time frame, and it remains both valid and proper. However, at the bottom of the certificate trace, the base Microsoft root certificate did not match. The observed misbehavior bubbled up from a relatively small but far flung set of systems well outside our test lab. We narrowed down this miniscule difference to just Windows XP machines that do not have any Windows Updates were the root certificate would have been automagically updated. However, for machines in labs or shops on private networks or with a locked down patch policy frozen to updates, this issue shut down BoundsChecker just as hard as the original Time Bomb.
How did we miss this in our regression tests? All our test systems are on the same network trunk and pull the same Windows updates. Even though we mix XP SP3 and newer Windows versions into our platform tests, none of them, whether physical or virtual machine, would have been off on a private isolated trunk. How will we prevent this ever again? If I had a dime for every time we made a change and declared “this shall never happen again” I could retire early to a beach house on Aruba. Our only saving grace is that with our ability to trap and deploy updates quickly means that DevPartner Studio 10.6 got fixed before RTM, and that we had a patch build for 10.5.3 available to anyone facing this regression within few days of identifying root cause.
I already know what April’s bug will be. It’s a kernel mode fix that has been refixed at least three dozen times. I guess I’ll have a few more dimes to collect to add to my Aruba fund.