Performance of conditional operator vs. fabs
Today, we will take a look at potential performance problems when using the conditional operator ?:
.
Specifically, we will use it to calculate a variable’s absolute value and compare its performance with that of the function fabs
.
Assume the following numerical code written in C, where we need to calculate the absolute value of a double
variable called residuum
.1
Since we want to perform this operation within the inner loop, we will have to keep performance overhead as low as possible.
To reduce dependencies on math libraries and avoid function call overhead, we manually get the absolute value by first checking whether residuum
is less than 0
and, if it is, negating it using the -
operator.
|
|
This looks easy enough and, in theory, should provide satisfactory performance.
Just to be sure, let’s do the same using the fabs
function from the math library, which returns the absolute value of a floating-point number.
|
|
Let’s compare the two implementations using hyperfine.2
|
|
As we can see, the fabs
implementation ran faster by more than a factor of 1.9!
Where does this massive performance difference come from?
Let’s use perf stat
to analyze the two implementations in a bit more detail.
|
|
The most important metrics here are the number of instructions and the number of cycles. Our processor can run around 4,250,000,000 cycles per second, resulting in a runtime of 0.48 seconds to process the roughly 4,000,000,000 instructions at 1.97 instructions per cycle.
|
|
The reduction from 2,000,000,000 to 1,000,000,000 cycles corresponds to the performance improvement of 1.95.
Using the fabs
function reduced the number of instructions by roughly 25% and, at the same time, increased the number of instructions per cycle to 2.89 (a factor of 1.47).
Getting rid of the conditional operator reduced the number of branches by half, allowing the processor to process more instructions per cycle.
The conditional operator is more or less a short-hand version of the if
statement and introduced a significant number of branches into our inner loop.
Running three nested loops with 1,000 iterations each resulted in 1,000,000,000 inner loop iterations, that is, we saved one instruction per inner loop iteration.
These branch and instruction differences can be checked in even more detail using objdump -S
; this is left as an exercise for the reader.
The magnitude of these performance differences is rather surprising and shows that it makes sense to check even seemingly simple code for potential performance problems.
-
The code shown is only an excerpt, the full code is available here. It was compiled with GCC 11.2 using the
-O2 -Wall -Wextra -Wpedantic
flags and the-lm
library. ↩︎ -
hyperfine performs a statistical performance analysis. It runs the provided commands multiple times to reduce the influence of random errors and calculates derived metrics such as the mean and standard deviation. ↩︎