Modern CPUs: an introduction

Friday May 1, 2026 performance

In this series, I propose to explore what are the specificities of modern CPU architectures, and what are their consequences when writing high-performance code.

Disclaimer: If you are already familiar with these topics, you probably won’t learn much. If you’re not—or just in the mood to cope with my ramblings—well, you are welcome to read on!

How things used to be🔗

I started coding when I was a kid on a Thomson TO8D, a French-designed home computer. These machines had limited computing power, even by the times’ standard: a Motorola 6809 8-bit processor clocking at 1 MHz, and a whopping 256 KiB of RAM. Some commercial programs could directly be grafted by plugging in a ROM memory cartridge (much like gaming consoles); eveything else had to be loaded from 3.5" floppy disks. As far as I remember, these used the so-called “double density” format (DD), allowing 720 KiB of storage. By contrast, my family’s later acquisition, an IBM-compatible PC, would handle the “high density” format (HD), which doubled disk capacity at 1.44 MiB.

Of course, I was just a kid, so performance was the least of my concerns. I was happy enough to dabble in Microsoft BASIC, which ran through an interpreter at a painfully slow pace. But for those who did have to worry about performance back then, things were rather simple: memory access was uniform (when the CPU only has 256 KiB to choose from, it doesn’t really need a cache), and if you coded in assembly language, you could litterally predict how fast your program would run by counting the cycles used by each instruction. Of course, I’m over-simplifying: even then, there must have been some subtleties like register usage and such. But you get the idea: the picture was rather uncomplicated.

How things are now🔗

Fast-forward forty years, and the situation has changed drastically. CPUs have grown vastly in complexity. They have been adjoined special coprocessor units (e.g. for floating point operations). Vectorized instruction sets have been added, each generation adding fancier capabilities and operating on wider data (e.g. SSE, SSE2, AVX, AVX2, AVX-512 for Intel architectures, or Neon for ARM). But, from my point of view, the two most impactful characteristics of modern architectures are pipelining and caching. Let’s dive into them.

Pipelining🔗

Modern CPUs are called “superscalar”, meaning they can execute more than one instruction per cycle—by contrast with so-called “scalar” processors that execute at most one instruction per cycle (and even less if instructions take multiple cycles). The term “superscalar” may actually imply a bunch of different things, but for practical purposes, the most important is that all modern mainstream processors support some form of pipelining: they can start executing an instruction without waiting for the previous one to “retire” (i.e. finish executing).

It turns out that this impacts performance coding a lot. Basically, if you want your processor to run at full speed, you need to keep its instruction pipeline fed. Unfortunately, that’s easier said than done, as obstacles are plenty…

The first is instruction dependency: if the execution of one instruction depends on the execution of a previous instruction, then the CPU must wait until the latter has retired to execute the former, effectively stalling the pipeline.

A common form of instruction dependency is a data dependency, where the result of an instruction depends on the result of a previous instruction.

Another form is control dependency, typically created by branches: each conditional jump may steer the execution towards one of two alternatives. If the CPU wants to keep running before retiring the condition (and therefore before knowing its result), it needs to take a guess: this is called branch prediction. Then it can tentatively execute the chosen branch: this is called speculative execution. When the CPU guesses right, all is well: the pipeline can keep running unhindered. But of course, it may guess wrong, in which case it needs to throw away all the work it has done since the conditional jump and start over, this time executing the correct branch. From a performance perspective, this is functionally equivalent to stalling the pipeline at the jump. (Unfortunately, it is not completely equivalent, as speculative execution may have side effects. This is how you get side-channel attacks, among which the infamous Spectre and Meltdown.)

As a consequence, in high-performance coding, making branches as predictible as possible, or even better, avoiding them altogether, is of the utmost importance.

Caching🔗

Another important characteristic of modern CPUs is that they have huge amounts of RAM to work from. (As of 2026, it’s in the order of ten gigabytes for laptops or small VMs, hundreds of gigabytes or even terabytes for big servers.) Addressing this vast space is slow. That’s why there are now multiple layers of cache sitting between the processor and main memory. A typical CPU has at least three layers of cache (dubbed L1, L2 and L3), each of increasing size but also increasing latency. (In practice, some layers—typically L1—may be split between data and instructions.) To complicate things further, modern CPUs have multiple cores, each functioning more or less as an independent processor with respect to executing instructions… but all sharing the main memory. Caches can be local to each core (especially for the lower levels) or shared between cores (as is typical for higher levels).

The practical implication is that accessing data from memory doesn’t cost the same depending on where the data is stored, and who is accessing it. This phenomenon is called NUMA, for Non-Uniform Memory Access.

As a consequence, to keep your code as fast as possibe, data locality becomes an important consideration: processing data that is already in (one of) the cache(s) is faster than processing data stored at some random location in memory.

This problem also compounds with the previous characteristic (pipelining) to form additional data dependencies: if instructions need to wait for memory, then it doesn’t matter much that they don’t depend on one another; at some point, the pipeline will stall all the same.

Practical consequences🔗

All right, so we want our code to be as branchless as possible, while working with local data! How do we achieve that in practice? Well, the answer obviously depends on each particular algorithm. Nevertheless, the same patterns occur time and time again.

A frequent solution is micro-batching. The idea is to handle small batches of data at each iteration, instead of a single item. Batching gives you at least three interesting properties:

It amortizes fixed costs (e.g. virtual calls)
It helps break data dependencies by creating multiple independent “lanes” of processing. This is also called loop unrolling
It makes the code amenable to vectorization, i.e. we can leverage special SIMD instructions (like AVX or Neon)

In the following posts, we will explore in more details each of the above aspects, along with simple benchmarks to demonstrate their impact.