The Compiler Bug

Matt Schellhas
4 min readJun 1, 2024

--

Part 3 of a series of bugs, all answering “What is the worst bug you ever had to track down?”

The Compiler Bug is the story I use most often in interviews. It wasn’t a particularly difficult bug to track down — it was known for years before I ever worked on it. And it wasn’t a particularly difficult bug to fix — a day’s work to implement. I use this story because it sounds impressive, had a lot of impact, and because was from my time at Facebook.

First some context.

While there, I worked on the Hack compiler. Facebook started out as a PHP shop. As they grew, PHP’s crazy bullshit outweighed its ease of use. Since “rewrite all of Facebook” was not a viable solution, they went with the completely reasonable option of building their own programming language.

Hack exists as a migration path. It added support for optional type annotations, allowing Facebook to gradually migrate their gigantic codebase from untyped PHP nonsense to statically typed goodness. The worst library functions were forbidden. The worst runtime semantics were replaced one by one with something better. That effort started years before I arrived, and continued after I left.

Relevant to this story is the language’s type inference algorithm.

In Hack, local variable types are inferred. Since it needed to infer the type of PHP variables, that means that it inherited PHP’s semantics. This is easier to show via an example.

$x = 42;

$x is inferred to be an int. Easy. If you try to pass it to a function that wants a bool then the type-checker complains, as you’d expect.

PHP though allows you to just reuse variables willy-nilly:

$x = 42;
foo($x);
$x = "no.";
bar($x);

As long as foo takes an int and bar takes a string, this will work just fine. Hack will infer the proper types in each scope, and make sure that your code won’t blow up at runtime. More complex code requires more complex type checking:

if ($a == $b){
$x = 42;
} else {
$x = "no.";
}

process($x);

What’s the type of $x after the conditional? Well, it’s the sum typeint|string. Literally int or string. process then needs to accept a base type that can take either type for the type-checker to accept it. Even this scenario is pretty straight-forward to infer. Infer each branch of the conditional, then union the results.

The compiler bug involved the last case — conditionals in a loop:

function foo(int $a): void {

$x = null;

while (true) {
if ($x == $a) {
break;
} else if ($x is null) {
$x = 0;
} else if ($a < 0) {
$x = "stuff";
}
}

foo($x);
}

What is the type of $x after the loop? It could be null, or int, or string. Intuitively, programmers understand that. Computers… are not good at intuition. What the type-checker actually sees is null|(null|int)|(null|string)|(null|(int|string))|(null|(string|int)) . That gets simplified down, but each permutation through the branches yields its own type. Yes, that means the performance of the type-checker gets factorially worse as the number of conditional branches within a loop increases.

That doesn’t really matter for most of the code in the world. Even O(n!) algorithms are fast enough when n is tiny.

The problem is that at Facebook scale, all manner of corner cases exist. Like its permissions system.

Any permissions system is a set of rules. “If the user is an admin, they can see a page.” “If the page is private, and the user is a friend, they can see a page.” “If the user is in France and French law prohibits this sort of content, block it.” That sort of stuff. The sort of stuff that you really, really don’t want to break.

Facebook (at the time) implemented that as a state machine. A start state, a loop to run the machine until it was done, and a whole bunch of conditionals to modify the state in that loop (though fewer than you’d think). It took 15 minutes to type check that one function. So they turned off the type checking… on the stuff you really, really don’t want to break.

Conceptually, the fix was pretty straight-forward. Instead of implementing each state inline and using a state variable to transition between states, I refactored the state machine so that each rule had its own function and state transitions simply became function calls. Since functions in Hack could declare their types, the inference algorithm had (exponentially) less work to do.

Practically… well, it was still code that you really, really didn’t want to break. How was I sure that I wasn’t breaking it? Would the function nesting cause stack overflow? Would the function call overhead degrade performance? After all, this was one of the hottest codepaths in the world — used daily by a few billion people and accounting for millions of dollars of global CPU costs.

Writing the fix took a day. Testing the fix took a month.

Thankfully, Facebook has glorious internal tooling for these things. The new version of the code shadowed the known-good version. A fraction of traffic would run both versions of the code, and any difference in results would be reported. Performance differences between the two versions would be measured across all of the different rules.

In the end, there weren’t any problems. The runtime team did a great job optimizing code so that the function calls were inlined. The state machine wasn’t terribly complex (and definitely not recursive) so stack overflow wasn’t a concern. And of course, I did my part in writing the code correctly.

One day the code just changed to be better and safer, and nobody in the world noticed.

--

--