The Faster C++ blog

Mathematical truth VS Computing reality

Posted on 2023-06-01

Matrix multiplication is at the core of most computationally expensive software pieces. Instances of huge matrix multiplications notably occur in both artifical intelligence and the Finite Element Method - used for a variety of applications like simulating solid structures' deformations. Those are two very important instances of such problems. Another lesser known example, but as important as those ones, is the Google Matrix. This matrix is huge, yet speed in multiplying it with any given list of vectors is essential to Google. This is why matrix multiplication has always been the focus of the advance of computing power on CPU, and now on GPU.

In this post, I would like to exhibit an example of how to accelerate some matrix multiplications, in the case where at least 3 matrices with different shapes are involved. This is known as a "Matrix Chain Multiplication".

Let A, B and C be 3 matrices. And assume we need to compute A.B.C Then maths tell us that we could either compute A.B, them multiply the result by C. Which we could write: (A.B).C Or the other way around, we could compute B.C in the first place. And then multiply A by the result of that sub-multiplication. In other words: A.(B.C) Those two methods are mathematically equivalent, since matrix multiplication is an associative operation. So it looks like there is no difference in implementing one or the other.

However, it is sometimes worth considering this from the point of view of the computer. For instance, assume the following sizes for the matrices:

A has size 1x3
B has size 3x2
C has size 2x4

Just as a reminder, computing the product of 2 matrices of respective sizes RxM and MxC requires RxCxM multiplications.

Then, computing (A.B).C would require (1x3x2)+(1x2x4) = 6+8 = 14 multiplications.

The second way of doing it would require (3x2x4)+(1x3x4) = 24+12 = 36 multiplications! More than twice the number of multiplications of the first alternative!

Therefore, I encourage you to really take your time and properly choose the best order of matrix multiplication in your algorithms.

Another benefit of doing less multiplications is it can also improve the numerical stability of your computations. Which is a subject I leave here for a later post.

On the irrelevancy of algorithmic complexity (Or maybe not!)

Posted on 2023-05-08

I tumbled upon an article describing B-Trees. As all balanced trees, they have logarithmic complexity on all basic operations : searching, inserting and deleting. Sounds great! But this reading immediately reminded me of a talk of the C++ expert Scott Meyers. Especially his saying that no matter what fancy data structure you come up with, an array will beat it (Check out the video at the end of this post).

I wanted to put this to the test by comparing B-trees with arrays on random accesses. This means searching elements in the data structures just like we would in any general real-world applications: In a way that is not predictable. In other words, I want to make sure that the compiler does NOT know the access pattern, to prevent any under the hood optimisation which would bias the conclusions of the benchmark. To achieve this, the access pattern is read at runtime from external text files. Additionnally, that access pattern is tailored such that each element is visited once and only once. This is just a simple way to ensure we will end up measuring the average case. As a side note, and quite obviously, the best case for the pair of vectors is just accessing the first pair of elements, and the worst is accessing the last one. Finally, the retrieved elements are just summed up, so that memory accesses dominate the execution time. For the B-Tree implementation, I simply used the one provided by Google here. For arrays, I implemented two side-by-side arrays that effectively act as a map. The first vector contains the keys, and the second contains the values. That idea is famously used in the (Absolutely amazing) EASTL library. I ran the comparison a hundred times, with 1,000 accesses each. Here are the results :

Data structure	Expected time complexity	Practical duration
B-Tree	O(log n)	0.79ms
Pair of arrays	n/2 (On average)	2.2ms

Oops, the pair of arrays is actually slower than the B-tree! I honestly did not expect that. So I digged more into it. Maybe the number of accesses has an impact on the comparison, since 1,000 integer elements won't fit into an L1 cache line. I repeated the test with different amounts of accesses, taking a hundred duration samples each time. Here are the results:

Number of accesses	Data structure	Duration
32	B-Tree	0.027ms
32	Pair of arrays	0.016ms
64	B-Tree	0.054ms
64	Pair of arrays	0.057ms
128	B-Tree	0.099ms
128	Pair of arrays	0.19ms
256	B-Tree	0.22ms
256	Pair of arrays	0.72ms
1,000	B-Tree	0.79ms
1,000	Pair of arrays	2.2ms
10,000	B-Tree	10ms
10,000	Pair of arrays	656ms

The log-log plot is as follows:

Plot of durations againsts number of accesses for a B-Tree and a pair of arrays

For smaller amounts of accesses, the pair of vector performs better than the B-tree. The equilibrium between those two data structures is reached for 64 random accesses. Precisely the size of my L1 cache lines! There, the B-tree gains the advantage and then keeps it asymptotically.

I was frankly surprised by these results. Computers' performance is definitely very hard to predict. This reinforces the idea that any performance prediction must be put to the test!

This is the talk of Scott Meyers that inspired this post:

_- Faster C++