Data locality: sequential vs random access
Continuing with our series about modern CPUs architecture, we now propose a simple benchmark to showcase the effects of data locality.
Methodology🔗
This time, we will be using Rust, because everyone needs variety in their diet. So let’s grill some crab! 🔥🦀 The full code is available on GitHub; look at benches/compute/data_locality.rs in particular. We use Criterion for instrumentation, and run the benchmark on an Apple MacBook Pro M5. Usual disclaimer: your mileage may vary.
To demonstrate data locality, we compare sequential accesses with random accesses to the same memory locations. We first fill a vector of 32-bit integers with random values; let’s call it the value array, and let be its size. We then add an indirection: instead of accessing the value array directly, we store the indices to access in a second array; let’s call that one the index array. We then loop over the index array, look up the corresponding item in the value array, and accumulate it into a sum. (We don’t care about the result; the sum is just here to make sure that the value is not discarded, therefore preventing the memory read from being optimized away.)
In the first variant of the benchmark (sequential access), the index array contains indices from to in ascending order, so that we access each item of the value array exactly once, in sequential order. In the second variant (random access), the index array also contains indices from to , but it is randomly shuffled. The result is that we also access each item of the value array exactly once, but the order in which we do so differs.
As always, the framework takes care of repeating the code to get meaningful results, so that we do many passes over the entire arrays. We also run the benchmark multiple times with varying array sizes.
Results🔗
A picture being worth a thousand words, here is what the results look like. (Note that each axis uses a logarithmic scale for readability.)
What’s happening here? When the input size is small, both variants behave similarly, because all the data fits in cache and stays there from one pass to the next. However, as the input size increases, it can no longer fit in cache.
For the sequential variant, this doesn’t matter much, because each cache miss forces an entire cache line (typically 64 bytes on Intel architectures, i.e. 16 values in our case) to be loaded. It could even be that the CPU populates the cache in bigger chunks, called cache sectors or cache blocks (the specifics depend on the architecture). Moreover, some CPUs are able to predict memory accesses, just like they predict branches. Sequential access being a common pattern, it is trivial for the CPU to pre-warm the cache with the next line/sector/block before exhausting the current one. Because of all that, cache misses barely make a dent in the performance of sequential accesses.
The same can’t be said of the random variant: as the input size increases, performance plummets, and keeps doing so as the sizes of the various caches (L1, L2…) are exhausted. After a while (not represented on the graph), the throughput would plateau, but at a much lower rate: for 4 MiB of data, the sequential variant crunches through the data at 3.9G items/s, while the random variant only achieves 1.6G items/s; that’s a 2.4x slowdown!
Conclusion🔗
Data locality cannot be neglected when you care about performance. You should favor handling data in batches that can fit in the various caches; the smaller the data, the lower level the cache, and therefore the faster access time you get. Also, there is a special bonus for sequential accesses thanks to the predictability of that pattern.