The curse of branching
Posted on
What was the first programming construction that you have learned? I guess that the "if-else" construction is in your top 3, probably alongside "for". Conditional branches appear absolutely essential in any codebase. It seems impossible to do anything good without them. But they come with a performance trade-off. And you might never have heard about it. Depending on what the hardware believes would be the branch taken (e.g. the "if" or the "else"), the performance of the code can differ dramatically! Wait?! How could the hardware believe in the first place?! And how does it impact performances? The answer is not so obvious. Understanding it requires knowledge of how memory accesses are optimized at the hardware level. Indeed, as described in a previous post, memory accesses are what drive or kill performances of modern day codes. Among the mechanisms implemented in modern hardware is the instruction pipeline. The perfect analogy to understand pipelining is the automotive assembly line. Let's say building a car requires 3 sequential steps, each involving one man for an hour. Building one car will take 3 hours. Now let's say you want to build 2 cars. Will it double the amount of time required? Well, not in an assembly line where each step is performed by a distinct worker! Think about it. Once the first step is done, the first worker can start working on the second car right away! No need to wait for the first car to exit the entire assembly chain. So with a well-designed assembly chain, building the second car would take only 1 hour more. Someone observing the producing rate of the chain would see one car produced per hour. Considering building a single car is worth three hours of work, such a quick production rate would seem magical to anyone not aware of the pipelining technique! Returning to the question of conditional branches... How do they impact the pipeline optimisation technique? In fact, when branching, a choice must be made at the pipeline level. Which instructions and data should be loaded into the pipeline? The code from the "if" section? Or from the "else"? The answer engineers have found is to implement a hardware branch predictor. Yes, you have read it right! There is a circuitry in your computer's motherboard which does prediction about the flow of instructions to execute. And this was done way before AI was a thing! There are multiple prediction algorithms. All simple and smart. Most of them can detect patterns like "This branch is almost always true", "This branch is taken one over two times", etc. All the information required for predictions per branch are stored in the BTB (e.g. Branch Target Buffer). Now, just like anything in a computer, this buffer has a limited size. So if you have a ton of branches in your code, it will struggle a lot and you will certainly get poor performances! Now, what happens after the prediction has been made? Well, the predicted branch would be loaded into the pipeline. At some point, the hardware would finally get the answer to what branch is really taken. From there, two possibilities: Either the correct branch was taken and everything is fine. Or, the wrong branch was taken, and here start the problems. All the instructions that were executed speculatively in the pipeline must be discarded. The pipeline is flushed. Then there is a fresh restart. Going back to the automotive analogy, this is equivalent to throw away all cars in progress, and wait another 3 hours for a car to finally exit the assembly line. A castastrophe in terms of productivity! So what can developers do about that? I realize, writing this post, that it should be broken into parts. Because there is so much to say about branches impact on performance! For the moment, I am just going to give you pointers to the solutions, namely: Branchless programming, and Profile-Guided Optimisation.