The Faster C++ blog

The curse of branching

Posted on 2022-10-25

What was the first programming construction that you have learned? I guess that the "if-else" construction is in your top 3, probably alongside "for". Conditional branches appear absolutely essential in any codebase. It seems impossible to do anything good without them. But they come with a performance trade-off. And you might never have heard about it. Depending on what the hardware believes would be the branch taken (e.g. the "if" or the "else"), the performance of the code can differ dramatically! Wait?! How could the hardware believe in the first place?! And how does it impact performances? The answer is not so obvious. Understanding it requires knowledge of how memory accesses are optimized at the hardware level. Indeed, as described in a previous post, memory accesses are what drive or kill performances of modern day codes. Among the mechanisms implemented in modern hardware is the instruction pipeline. The perfect analogy to understand pipelining is the automotive assembly line. Let's say building a car requires 3 sequential steps, each involving one man for an hour. Building one car will take 3 hours. Now let's say you want to build 2 cars. Will it double the amount of time required? Well, not in an assembly line where each step is performed by a distinct worker! Think about it. Once the first step is done, the first worker can start working on the second car right away! No need to wait for the first car to exit the entire assembly chain. So with a well-designed assembly chain, building the second car would take only 1 hour more. Someone observing the producing rate of the chain would see one car produced per hour. Considering building a single car is worth three hours of work, such a quick production rate would seem magical to anyone not aware of the pipelining technique! Returning to the question of conditional branches... How do they impact the pipeline optimisation technique? In fact, when branching, a choice must be made at the pipeline level. Which instructions and data should be loaded into the pipeline? The code from the "if" section? Or from the "else"? The answer engineers have found is to implement a hardware branch predictor. Yes, you have read it right! There is a circuitry in your computer's motherboard which does prediction about the flow of instructions to execute. And this was done way before AI was a thing! There are multiple prediction algorithms. All simple and smart. Most of them can detect patterns like "This branch is almost always true", "This branch is taken one over two times", etc. All the information required for predictions per branch are stored in the BTB (e.g. Branch Target Buffer). Now, just like anything in a computer, this buffer has a limited size. So if you have a ton of branches in your code, it will struggle a lot and you will certainly get poor performances! Now, what happens after the prediction has been made? Well, the predicted branch would be loaded into the pipeline. At some point, the hardware would finally get the answer to what branch is really taken. From there, two possibilities: Either the correct branch was taken and everything is fine. Or, the wrong branch was taken, and here start the problems. All the instructions that were executed speculatively in the pipeline must be discarded. The pipeline is flushed. Then there is a fresh restart. Going back to the automotive analogy, this is equivalent to throw away all cars in progress, and wait another 3 hours for a car to finally exit the assembly line. A castastrophe in terms of productivity! So what can developers do about that? I realize, writing this post, that it should be broken into parts. Because there is so much to say about branches impact on performance! For the moment, I am just going to give you pointers to the solutions, namely: Branchless programming, and Profile-Guided Optimisation.

Infering performance by counting instructions VS Reality

Posted on 2022-10-24

In a previous post, I described how to extract the assembly code from the C++ using GCC. One would be tempted to just count the instructions, and make a deduction about the performance of the code. This approach is totally oblivious of what really happens when the code is executed. Especially, it does not take the memory model into account. Now let me bring back some memories (Pun intended!). The 1965 infamous Moore's law predicts that the number of transistors on a chip doubles every couple of years. Where are those transistors? On the CPU! So CPUs have kept increasing in performance at an exponential rate, for almost 70 years! But, a CPU without data to work on is actually pretty much useless... In fact, both the instructions to execute and the data to work on are stored in memory cells in the first place. And guess what, the speedup of memory cells has been notoriously lower than CPUs'. In fact, orders of magnitude lower! According to this excellent article, the performance of CPUs increased by a factor of 10,000 between 1980 and 2010! Do you know how much was the speedup of memory cells in the same time frame? ... Less that 10! So the risk in modern days is having the CPU wait for data from memory (Both instructions and working set). And running at nearly 0% of its capabilities. So how did tech engineers make up for this gap? The answer is in the multi-level memory architectures. It all relies on the fact that tiny memory cells can be quicker, but more expensive. I am talking about sizes of a few dozens of kilobytes. While bigger memory cells, maybe megabytes sizes, are slower, but cheaper. To get the best of both worlds, multi-level architectures implement complex mechanisms to turn the tiny, fast, memories into substitutes for the bigger ones. Just in time. Ideally, the CPU doesn't even notice that there are other slower memories around it. And it believes that only the fastest memory level exists. How does the multi-level architectures work? Well, this is a deep topic and I might make a specific post later about that. But for the moment, the best I can do is to direct the interest reader to this article. Resuming to the point of this post: the number of instructions executed by a program is a very poor indicator of performance, because bottlenecks are mostly in retrieving data. So unless two programs have a HUGE difference in the number of instructions, it certainly should not be taken into account. And as always: Performance is not infered. It is measured.

_- Faster C++