Today, we will take a look at potential performance problems when using the conditional operator
Specifically, we will use it to calculate a variable’s absolute value and compare its performance with that of the function
Assume the following numerical code written in C, where we need to calculate the absolute value of a
double variable called
Since we want to perform this operation within the inner loop, we will have to keep performance overhead as low as possible.
To reduce dependencies on math libraries and avoid function call overhead, we manually get the absolute value by first checking whether
residuum is less than
0 and, if it is, negating it using the
This looks easy enough and, in theory, should provide satisfactory performance.
Just to be sure, let’s do the same using the
fabs function from the math library, which returns the absolute value of a floating-point number.
As we can see, the
fabs implementation ran faster by more than a factor of 1.9!
Where does this massive performance difference come from?
perf stat to analyze the two implementations in a bit more detail.
The most important metrics here are the number of instructions and the number of cycles. Our processor can run around 4,250,000,000 cycles per second, resulting in a runtime of 0.48 seconds to process the roughly 4,000,000,000 instructions at 1.97 instructions per cycle.
The reduction from 2,000,000,000 to 1,000,000,000 cycles corresponds to the performance improvement of 1.95.
fabs function reduced the number of instructions by roughly 25% and, at the same time, increased the number of instructions per cycle to 2.89 (a factor of 1.47).
Getting rid of the conditional operator reduced the number of branches by half, allowing the processor to process more instructions per cycle.
The conditional operator is more or less a short-hand version of the
if statement and introduced a significant number of branches into our inner loop.
Running three nested loops with 1,000 iterations each resulted in 1,000,000,000 inner loop iterations, that is, we saved one instruction per inner loop iteration.
These branch and instruction differences can be checked in even more detail using
objdump -S; this is left as an exercise for the reader.
The magnitude of these performance differences is rather surprising and shows that it makes sense to check even seemingly simple code for potential performance problems.
hyperfine performs a statistical performance analysis. It runs the provided commands multiple times to reduce the influence of random errors and calculates derived metrics such as the mean and standard deviation. ↩︎