Experimental phasor extrapolation for Optimized CPU gridder kernel (!32) · Merge requests · ResearchAndDevelopment / idg

Bram Veenboer requested to merge update-cpu-optimized into master Nov 25, 2020

This branch contains a number of changes:

Resolve CMake warning message with CMake 3.18.4 (I haven't seen these with an older version, e.g. 3.16.2)
Cleanup of include directories for CPU Optimized kernels
Undo some 'optimizations' for CPU Optimized gridder kernel (this actually improves performance!)
Cleanup of Sine/Cosine lookup table code, only one (the fastest) option is left
Minor cleanup of VML related code
Add USE_PHASOR_EXTRAPOLATION option to CMake, to enable experimental phasor extrapolation in CPU Optimized gridder kernel

First of al, I made sure that there is no performance regression on DAS-5. This was tested on node504, with gcc/8.3.0 and Intel MKL from intel/2020.1. The performance of the CPU Optimized gridder kernel improved from 287 GFLop/s to 305 GFlop/s without changing any other (CMake) settings. After enabling phasor extrapolation (-DUSE_PHASOR_EXTRAPOLATION=True), performance increased to 380 GFlop/s. These are respectively 6% and 32% performance gains.

For most of the tests, I used a different system with AMD Ryzen 5 3600 CPU, GCC 10.2.1 and libc 2.32. Note that the libc version is much newer than on DAS-5 (2.17), this libc has a much-improved Sine/Cosine. Some results:

Default: 145 GFlop/s
MKL: 75 GFlop/s (From Intel 2020.2. Intel limits the MKL performance on non-Intel hardware)
MKL + hack: 175 GFlop/s (Trick MKL into believing it is dealing with an Intel CPU, see.)
Lookup table: 125 GFlop/s
Phasor extrapolation: 240 GFlop/s (+65% over default, +37% over MKL+hack)
Phasor extrapolation + MKL + hack: 222 GFlop/s (unknown why this is slower than without MKL)

For comparison, this is the best I got with the Intel (2020.2) compiler:

Intel compiler + MKL + hack + extrapolate: 193 GFlop/s (+33% over default)

I tested correctness with test-cpu-optimized.x, which compares the result of the CPU Optimized gridder kernels versus the CPU Reference gridder kernel.

Default: r_error: 0.030066, i_error: 0.001318
Lookup table: r_error: 0.050847, i_error: 0.001660 (can be reduced by increasing size of the lookup table)
Phasor extrapolation: r_error: 0.039210, i_error: 0.001452 (depends on the number of channels, this test has 9 channels)

To summarize The CPU Optimized kernel code has been cleaned up a little. A new optional phasor extrapolation implementation was added to the CPU Optimized gridder kernel. This implementation has a superior performance (about 30% over the typical 'best-case' setting), while the accuracy seems to be rather good.

Experimental phasor extrapolation for Optimized CPU gridder kernel

Merge request reports