The Rocket Sheep

Branch prediction and dynamic dispatch

In this article, we will examine the cost of dynamic dispatch, how branch prediction can alleviate it, and to what extent.

What is dynamic dispatch?🔗

Dynamic dispatch is the process of calling a function when the choice of the called function is done at runtime (hence “dynamic”). The opposite is static dispatch, when the called function is known at compile time. Note that only the choice is dynamic; the function itself already exists. We are not talking about Just-In Time (JIT) code generation.

In C++, dynamic dispatch usually manifests itself as virtual functions. In C, that would be function pointers. In Rust, all functions have static dispatch by default; you need the dyn keyword to force a dynamic dispatch.

One important thing to understand is that the dynamic nature of the process is really a property of the call itself, not just the function: you may end up doing a static dispatch to a virtual function if the type of the target object is known at compile time.

Benchmark🔗

Static dispatch🔗

Let’s first define an abstract operation and two concrete implementations. As always, you may browse the full source code online. Look at src/bench/misc/baseline.cpp.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
struct DummyOp
{
    virtual ~DummyOp() = default;
    virtual uint64_t apply(uint64_t input) const = 0;
};

struct IncOp : public DummyOp
{
    __attribute__((noinline))
    virtual uint64_t apply(uint64_t input) const override { return input + 1; }
};

struct DecOp : public DummyOp
{
    __attribute__((noinline))
    virtual uint64_t apply(uint64_t input) const override { return input - 1; }
};

The noinline attribute may look silly for a virtual function, but remember that even virtual functions may be statically called.

With that in place, let’s establish a baseline for static dispatch: on my laptop, we are able to do 2.30G calls/s. Not bad! That’s roughly half the max CPU frequency.

Dynamic dispatch (random)🔗

How does dynamic dispatch compare to that? And more importantly, what is the impact of branch prediction?

To answer both questions at once, we create an array of 1 Mi operation, pointing randomly to either IncOp or DecOp. The trick is that we vary the ratio of the mix: from 0% (only DecOp) to 100% (only IncOp). At 50%, both operations have equal probability. Here are the results:

Dynamic dispatch (random): throughput

We get a nice U-shaped curve! 😍 When the operation is predictable, we reach 1.87G calls/s. Still not bad, but already significantly less than static dispatch (-19%). 🐢 When the operation becomes less predictable, however, the throughput drops quickly, reaching a lower point of 286M calls/s when (as expected) the two operations have equal probability. That’s roughly 8 times slower than static dispatch! 🐌

Of course, this variation can entirely be attributed to branch mispredictions:

Dynamic dispatch (random): branch misses

Dynamic dispatch (predictable)🔗

In the real world, though, we rarely need to execute purely random operations. Imagine a database or a search engine executing a query, for example: granted, the operations are only known at runtime (that’s the user’s query), but they remain the same for all rows/documents to scan. Therefore, can we produce a more realistic benchmark?

For that, we create a small number of operations with a random pattern, but repeat the same pattern at each iteration. In total, we do the exact same number of operations as before (1 Mi). How does that fare?

Dynamic dispatch (predictable): throughput

When the pattern is small (4 operations), the branch predictor doesn’t have too much trouble guessing: we get 1.52G/s, which is similar to the fully predictable cases from the previous benchmark. As the size of the pattern grows, however, the throughput decreases, indicating that the branch predictor is struggling. With 64 Ki distinct operations, we are down to 285M/s, very close to our all-time low of for fully random inputs.

Again, this intuition is confirmed by performance counters. Up to a pattern of size 1024, the CPU does a good job, nearly avoiding branch mispredictions. But then the pattern becomes too big to keep track of, and mispredictions skyrocket:

Dynamic dispatch (predictable): branch misses

Conclusion🔗

To some extent, the branch predictor in modern CPUs can “de-virtualize” virtual function calls. However, that assumes a relatively predictable and small invocation pattern.

Also, no matter how clever the CPU is, a dynamic dispatch remains more expensive than a static dispatch.