Experimental phasor extrapolation for Optimized CPU gridder kernel
This branch contains a number of changes:
- Resolve CMake warning message with CMake 3.18.4 (I haven't seen these with an older version, e.g. 3.16.2)
- Cleanup of include directories for CPU Optimized kernels
- Undo some 'optimizations' for CPU Optimized gridder kernel (this actually improves performance!)
- Cleanup of Sine/Cosine lookup table code, only one (the fastest) option is left
- Minor cleanup of VML related code
- Add
USE_PHASOR_EXTRAPOLATION
option to CMake, to enable experimental phasor extrapolation in CPU Optimized gridder kernel
First of al, I made sure that there is no performance regression on DAS-5. This was tested on node504
, with gcc/8.3.0
and Intel MKL from intel/2020.1
. The performance of the CPU Optimized gridder kernel improved from 287 GFLop/s to 305 GFlop/s without changing any other (CMake) settings. After enabling phasor extrapolation (-DUSE_PHASOR_EXTRAPOLATION=True
), performance increased to 380 GFlop/s. These are respectively 6% and 32% performance gains.
For most of the tests, I used a different system with AMD Ryzen 5 3600 CPU, GCC 10.2.1 and libc 2.32. Note that the libc version is much newer than on DAS-5 (2.17), this libc has a much-improved Sine/Cosine. Some results:
- Default: 145 GFlop/s
- MKL: 75 GFlop/s (From Intel 2020.2. Intel limits the MKL performance on non-Intel hardware)
- MKL + hack: 175 GFlop/s (Trick MKL into believing it is dealing with an Intel CPU, see.)
- Lookup table: 125 GFlop/s
- Phasor extrapolation: 240 GFlop/s (+65% over default, +37% over MKL+hack)
- Phasor extrapolation + MKL + hack: 222 GFlop/s (unknown why this is slower than without MKL)
For comparison, this is the best I got with the Intel (2020.2) compiler:
- Intel compiler + MKL + hack + extrapolate: 193 GFlop/s (+33% over default)
I tested correctness with test-cpu-optimized.x
, which compares the result of the CPU Optimized gridder kernels versus the CPU Reference gridder kernel.
- Default:
r_error: 0.030066, i_error: 0.001318
- Lookup table:
r_error: 0.050847, i_error: 0.001660
(can be reduced by increasing size of the lookup table) - Phasor extrapolation:
r_error: 0.039210, i_error: 0.001452
(depends on the number of channels, this test has 9 channels)
To summarize The CPU Optimized kernel code has been cleaned up a little. A new optional phasor extrapolation implementation was added to the CPU Optimized gridder kernel. This implementation has a superior performance (about 30% over the typical 'best-case' setting), while the accuracy seems to be rather good.