What is a Benchmark?
Benchmarks are how we measure performance, obviously. They’re a program that we run to tell us how fast our code is.
This intuitive definition is what most of us would say if presented with the title question. But I want to suggest a simple question: if benchmarks tell us how fast something is, what does that make production performance metrics? Try to answer this question without admitting that benchmarks are doing something other than “telling us how fast a system is”. If they’re not doing that, what are they doing?
Benchmarks are a hypothesis. And I mean that in the earnest sixth-grade science class sense of hypothesis: a falsifiable statement about the world, supported by existing evidence, which can be used to make future predictions.
Benchmarks as Hypotheses
Benchmarks express the hypothesis that changes in their execution performance will be reflective of real-world changes in performance in our system.
For a benchmark to be maximally useful, we need it to work as a hypothesis in several specific ways:
Direction: If the benchmark improves, that should effectively predict that real-world performance will improve. And if the benchmark regresses, we should expect real-world performance to regress.
Magnitude: We also want our benchmark to be useful in predicting the size of performance improvements. At the very least, bigger changes in benchmark results should imply bigger expected changes in production performance. Importantly, we don’t really expect most benchmark results to precisely match production systems, because most benchmarks are a distillation of a portion of an end to end process and Amdahl’s law tells us that there is a limit to the benefit of speeding up only a portion of a process.
Negative results: Last, but certainly not least, we want our benchmark to express no (statistically significant) change in results if and only if production performance will not change. This, notably, gives us the property that no regression in the benchmark implies we can expect no regression in production.
Alas, one thing that’s immediately clear is that for many systems there’s no single benchmark that could capture all production performance dynamics. To take a simple example, the introduction of caching speeds up reads at the expense of extra work in the write path. Users with read-heavy workloads should expect different performance impacts than users with write-only workloads.
As a result, large systems generally need suites of benchmarks, rather than any single benchmark. The fact that sometimes they will point in different directions is frustrating, but is also an unavoidable reality in a complex system.
This framework of benchmark-as-hypothesis is helpful, because it provides us a vocabulary for describing the limitations of a benchmark. For example, micro-benchmarks can often (though not always) capture the directionality of performance changes in a real system, but are rarely useful for predicting the magnitude of changes in a larger system. While we can obviously make basic observations about the limits of micro-benchmarks without a benchmark-as-hypothesis framing, the framing helps us be more precise and analyze more complex circumstances.
Conclusion
Performance is a perfect manifestation of two idioms: “What gets measured gets managed”, by Peter Drucker, and “when a measure becomes a target, it ceases to be a good measure”, by Charles Goodhart. Benchmarks are an essential part of performance engineering, allowing us to measure and manage improvements to performance. However, it’s always possible to construct benchmarks and optimizations where the benchmark improves but real-world performance does not – and not only is it possible, we very frequently do construct such benchmarks, fooling ourselves into false beliefs about our programs' performance. We must never lose sight of the idea that the purpose of a benchmark is ultimately to predict real-world performance results, and if it’s not capable of doing so reliably, it has limited value.
We can also extend this type of thinking to other features of the software engineering process. Unit tests are a hypothesis about correctness. Penetration tests are a hypothesis about security. But I’ll leave those for another day, or perhaps as an exercise for the reader.