The Hamburger Bug

Matt Schellhas
5 min readMay 10, 2024

--

Part 2 of a series of bugs, all answering “What is the worst bug you ever had to track down?”

This answer is the answer I use if I think that my audience wants a more involved discussion, or is specifically interested in deep technical knowledge. Because this is the hardest bug I’ve ever had to track down, even if it wasn’t the biggest or most valuable.

The Project

In my first team lead role I got to work on a pretty awesome, very greenfield project. It was a digital menu board system for quick-serve restaurants. You know — the handful of flat-screen TVs over the cashier’s head at your local McDonalds/Dunkin’/whatever, showing you prices while also enticing you to buy that dancing hamburger.

Such things are very common today, but didn’t exist as recently as 25 years ago. I was responsible for the little boxes that sat in each store and displayed the menus to customers. This is basically what I got when I started:

  • Must support 1–8 screens of varying resolution and orientation.
  • Must be able to support Adobe Flash, Windows Media Videos, and a dozen other formats that were already bad and old by then.
  • Must be able to run 24/7 without human intervention, on-site, with slow, intermittent Internet connectivity.
  • Must be able to display content in a statistically rigorous way to support A/B testing, even though not all content is guaranteed to download/display successfully. (This was important, since the sales pitch was that we could algorithmically optimize ad content to make restaurants more money.)
  • You have two engineers in a remote office to help you and a year to do it. Good luck.

What I built was a nicely multithreaded app with different components. One to fetch content. One to generate the playlist based on available content. One for log management. One to track what was actually displayed and return it to a central server. One to actually layout and render the content. All of them talked to each other via messages across a central thread-safe in-memory message bus.

That design was really great. The renderer itself was a bit hackier.

It was essentially a handful of C#WebBrowserControls running in a WinForms app; one for each screen. The controls in turn each ran a pile of JavaScript. C# would invoke JavaScript functions to move to a new scene and JavaScript would do all of the DOM manipulation to lay out the menu, load it in the background, swap it with the current scene, and tear down the old one. A simple double buffering setup. The JavaScript would make callbacks into C# events for OnError, OnSceneStart, and similar things, creating a feedback loop between the two parts of the renderer.

Even today, it’s one of the coolest things I’ve ever built.

The Bug

It took about six months to get all of that working reliably. Month eight was spent chasing this bug. It started — as most bugs do — with a vague report from QA: “I came in this morning and one of the machines was frozen again”.

I walk over, and sure enough it’s just sitting there on one scene. Alt-tab. Yup, my software was stuck, not the OS. Check the log. No errors. Just normal “scene successful” messages until 4AM, then nothing. I thank the QA person, and ask them to add the log and the playlist definition to the bug report. We restart the app and it happily displays dancing hamburgers for the rest of the day.

“Again” referred to the previous titleholder for Worst Bug I Ever Had To Track Down. A month earlier there were similar symptoms. That one was caused by a race condition where I added the C# event handlers in the line of code after telling JavaScript to render the scene, yet JS managed to do all of that and fire the event before the event handler was in place. One week to diagnose, five seconds to fix.

So I went right to that bit of code to try and reproduce it. Big content, small content. Long playlists, short playlists. All sorts of stress testing to elicit any lingering race conditions. Nothing.

The following week, QA reports similar occurrences. Sometimes one or all screens would be black rather than stuck on one scene. Some nights multiple test machines would get stuck. Some nights none. Occasionally they would get stuck during the day and someone would notice. Different videos so it wasn’t some sort of corruption there. Different playlists, so it wasn’t just some typo where it was set to play for 10 hours rather than 1. Different machines so it wasn’t some bad display driver or something.

The only discernable pattern (at first) was that it never happened on a dev machine, regardless of how anyone poked or prodded it and regardless of how long we let the same exact playlists with the same exact content run.

The breakthrough came from the helpful QA person who reported the bug in the first place. “I think I’ve only seen it happen on these machines.”

Okay cool. What’s different about those?

“Leave it running overnight” is not an ideal cycle time for experimentation, but we narrowed it down in steps:

  • The bug only shows up on certain versions of Windows. Fair enough, C# libs and drivers do change subtly from version to version.
  • It requires there to be multiple screens attached. Sure. It’s pretty clearly a concurrency bug of some sort.
  • It only happens when there was WMV files in the playlist. Okay? Some sort of incompatibility there?
  • And it only happened to machines with Internet Explorer 8 installed. What the fuck? Why would a separate install that we don’t even use change how our software behaves?

And that’s how I came to learn how to use WinDbg.

The Root Cause

Debugging through the .NET runtime was fruitless. Many threads, all working as expected.

Digging through core files was fruitless. A normal looking app, with everything running as expected.

And one day I happened to catch it in the debugger as it happened:

The JavaScript engine crashed.

A single exception, only able to be seen in the kernel level debugger and then the whole beast slipped silently from memory.

After I recovered from my horror, I began to piece things together:

  • My software wasn’t using IE8, but the WebBrowserControl did use the JavaScript engine that came with it.
  • That particular version of windows did release some JavaScript engine enhancements.
  • So some internal JavaScript used by the WMV player or by the callbacks it triggered were enough to eventually find a bit of code in those enhancements that wasn’t quite concurrency safe, killing the engine itself.

I wrote up a nice big bug report for Microsoft, but never heard back. My bosses were dismayed to hear that we couldn’t support Window Media Files, but it was a moot point. Our promised algorithmic ad optimization only managed to increase profits by $2 per store per month. I managed to leave a few months before the entire project was canned as a good idea that didn’t work out.

--

--

Matt Schellhas
Matt Schellhas

Written by Matt Schellhas

Dour, opinionated leader of software engineers.

No responses yet