False sharing: A vicious problem when multi-threading
Posted on
Multi-threading is a lightweight solution to bring parallelism to a code. Under ideal conditions, your computing time can be divided by the number of threads you use. In practice, there is always some overhead in using them, and a need for post-processing to merge the results. But the real problem with multi-threading is that they are notoriously difficult to implement properly. Data races and false sharing are the main concerns. I will focus on false sharing in this post. Simply put, false sharing happens when two threads work on different variables (Which is actually nice!), but those variables are located in the same cache line (That is the bad part!). This leads to cache misses after cache misses. And it results in a huge performance penalty! False sharing differs from data races in the sense that the results you get at the end are not corrupted. There is no logical problem with the code per se. The only thing is you could have gotten your threads running way faster with some adjustments. So it is definitely worth checking it out! The following code seems perfectly fine, yet it contains a false sharing problem:
#include <iostream> #include <chrono> #include <thread> using namespace std; using namespace std::chrono; struct Data { long a; long b; // sits on the same L1 cache line as the "a" member Data() : a(0l), b(0l) { } }; const long nb_iter = 100000000l; void work(long &e) { while(e<nb_iter) { e++; } } int main() { const auto tic = high_resolution_clock::now(); Data d; std::thread t1(work, ref(d.a)); std::thread t2(work, ref(d.b)); t1.join(); t2.join(); const auto toc = high_resolution_clock::now(); const std::chrono::duration<double, std::milli> elapsed = toc - tic; cout << "a = " << d.a << endl; cout << "b = " << d.b << endl; cout << "Computation time: " << elapsed.count() << "ms" << endl; return 0; }
I have run 100 executions and recorded the timings. They average to around 600ms. Now let us add useless variables between the data members, and see if there is any consequence on performance. My computer has an L1 cache of 64 bytes, and the long type sits on 8 bytes. So I will insert 7 long variables to ensure the b member sits on a different L1 cache line than a:
struct Data { long a; long padding[7]; // seemingly useless variables long b; // now sits on a different L1 cache line Data() : a(0l), padding(), b(0l) { } };
Running yet another 100 rounds of executions and compiling the results shows an astonishing result: The average computing time drops significantly, at around 200ms!!! If you read my post regarding benchmarking, you know that it is wise to use the Student test to ensure this observation is not by random chance. So I did exactly that using Scipy. The Student test applied on my benchmark data showed that there was only one chance over 10250 for the measurements to be pure coincidence... So the effect of the padding variables is indeed real! How in the world is that possible?! Adding useless code speeds things up?! Yes it does! Because now the data members live in distinct L1 cache lines. And this is guaranteed to prevent L1 cache misses! After reading this post, you might be tempted to have another look at your data structures!