Data dependency: the scalar aggregation case
In the previous article, we gave an overview of how modern CPU architectures influence code performance. Now we are going to explore a concrete example of data dependency: that’s when the CPU pipeline stalls because one instruction depends on the result of an earlier instruction.
An easy way to showcase such dependencies is with a scalar aggregation. It’s a kind of computation that aggregates an entire dataset to a single value, the canonical example being a sum. Yes, even with such a simple algorithm, effects of modern CPUs are visible!
Methodology🔗
First of all, a few words about our methodology:
The benchmark is written in C++. (It turns out it would work not as well in Rust; not because this language is any less optimized, but because it gives us less control on what optimizations are applied—and this matters here, as we will soon see.)
We give only short, simplified snippets in the article. You may obtain the full source code in our GitHub repository, and try to reproduce the results. (Or invalidate them! As often when performance is involved, your mileage may vary.) In particular, look at agg_scalar.hpp.
We leave the scaffolding and instrumentation to Google Benchmark. All code is single-threaded.
Timings were obtained on an Apple MacBook Pro M5, and the code was compiled with LLVM’s Clang 22.1.2. (Not Apple’s version shipping with the Xcode command-line tools, although in practice they don’t differ by much.)
All right. With that out of the way, let’s dive into the code!
Naïve implementation🔗
Leaving out the boilerplate, the naïve implementation of a sum goes like this:
|
|
Digression: If you look at the full benchmark code, you’ll notice that it is substantially more complex. That’s because we use the same prototype as std::reduce, which is generic with respect to the value type, the collection type (you just need a forward iterator), the initial value and the operation. In other words, it’s the most generic form of a scalar aggregation. But let’s keep our snippets simple!
To benchmark the code, we aggregate batches of 16 Mi random 32-bit integers (Mi = mebi = 1024 x 1024), and measure the throughput, in terms of items per second. This gives us a baseline of 4.4 G/s. Not bad for a CPU clocking between 3 GHz and 4.6 Ghz! That’s close to the maximum CPU frequency, indicating that the processor is able to retire almost one instruction per cycle. ⚡️ But can we do better?
What is our limiting factor here? Clearly, the addition is: before executing one addition, the CPU needs to know the result of the previous iteration. By contrast, the rest of the loop can be pipelined: incrementing iterators, checking the exit condition… Even the branch can be predicted easily, as every iteration except the last jumps back to the beginning of the loop.
In other word, our pipeline-breaker is the data dependency between additions at each iteration of the loop. Can we remove this dependency? 🤔
Manually unrolled implementation🔗
Yes, we can, with loop unrolling (sometimes also called loop unwinding). The idea is to aggregate using multiple “lanes” of data. Instead of reading items one by one and adding to a single result, we read items K by K (where K is the number of lanes), and add them to K separate partial sums. (If the size of the input is not a multiple of K, we drain the remainder using scalar code.) That way, each partial sum is independent of the others, and the CPU is better able to pipeline them. Of course, at the end of the aggregation, we still need to reduce all partial sums into the final result, but that’s just one tiny scalar loop (same as for draining the input).
The code now looks like this (again, leaving out details for simplicity):
|
|
A tad more complex, undeniably. 😅 But how does it behave? Well, that depends on the number of lanes. Considering that the naïve version is conceptually equivalent to having one single lane, we get:
| Lane count | Throughput (G/s) | Speedup ratio |
|---|---|---|
| 1 | 4.4 | (baseline) |
| 2 | 7.6 | 1.7 |
| 4 | 12.0 | 2.7 |
| 8 | 14.9 | 3.4 |
| 16 | 17.8 | 4.0 |
| 32 | 10.9 | 2.5 |
Holy sheep! 😮 With 2 lanes, we already improve throughput by 70%. With 4 and 8 lanes, we triple it, and quadruple it with 16 lanes.
Note that the performance gain is lower than the number of lanes, which is understandable: each iteration of the loop does other things that just additions; we need room for the iteration boilerplate (increment, check, jump). Also, even when instructions overlap, you still have wasted cycles at the beginning and the end of the unbroken pipeline section, where instructions must execute alone.
Furthermore, you may have noticed that performance plummets with 32 lanes (although never below the baseline). At this point, we have exceeded the length of the CPU pipeline, and maintaining parallel lanes simply becomes counter-productive.
Vectorized implementation🔗
How does our implementation compare to C++’s built-in std::reduce function? Well, let’s see: this one peeks at 20.1 G/s, which is 13% faster than our fastest implementation. 🤯 How did they pull that trick?
While our unrolled code parallelizes instructions to some extent, they still are separate instructions, and are “staggered” in the pipeline. As explainer earlier, this wastes some cycles at the two extremities. In other words, our code is “superscalar”, i.e. fine-tuned to saturate the CPU pipeline, but still made of sequential, scalar instructions.
However, modern CPUs have SIMD instructions (Single Instruction, Multiple Data) that can effectively process multiple lanes at the same time (and not just in close succession, as in pipelining). To leverage these instructions, you have multiple options:
-
Use architecture-specific “intrinsics”, i.e. special functions that translate directly into the corresponding assembly instruction (see Intel’s for example)
-
Leverage more portable libraries, like Google Highway in C++, or the simd module in Rust
-
Rely on the compiler to auto-vectorize the code
It turns out that recent compilers are rather proficient at auto-vectorization. Our simple example is a piece of cake for them! 🍰 This is what happens to std::reduce: the scalar aggregation is silently rewritten by the compiler to use multiple lanes and SIMD instructions.
By the way, the moment has come to confess a secret: if we did nothing, our code snippets above would be auto-vectorized by the compiler. To exhibit data dependencies, we had to explicitly neutralize some optimizations. That’s why, if you look at the full code samples, you’ll see preprocessing directives like:
|
|
Without them, Clang is too smart: it detects the potential dependencies and works around them for you! 💪
Conclusion🔗
The morale of the story is that data dependencies can become a bottleneck for high-performance code. In our simple example, removing the dependencies quadruples the throughput.
Another key learning is that, in many simple situations, the compiler may be able to optimize the code for you. However, in more complex situations, that may not be the case, so that loop unrolling is a nice trick to keep up your sleeve.