Skip to content

Experimental phasor extrapolation for Optimized CPU gridder kernel

Bram Veenboer requested to merge update-cpu-optimized into master

This branch contains a number of changes:

  • Resolve CMake warning message with CMake 3.18.4 (I haven't seen these with an older version, e.g. 3.16.2)
  • Cleanup of include directories for CPU Optimized kernels
  • Undo some 'optimizations' for CPU Optimized gridder kernel (this actually improves performance!)
  • Cleanup of Sine/Cosine lookup table code, only one (the fastest) option is left
  • Minor cleanup of VML related code
  • Add USE_PHASOR_EXTRAPOLATION option to CMake, to enable experimental phasor extrapolation in CPU Optimized gridder kernel

First of al, I made sure that there is no performance regression on DAS-5. This was tested on node504, with gcc/8.3.0 and Intel MKL from intel/2020.1. The performance of the CPU Optimized gridder kernel improved from 287 GFLop/s to 305 GFlop/s without changing any other (CMake) settings. After enabling phasor extrapolation (-DUSE_PHASOR_EXTRAPOLATION=True), performance increased to 380 GFlop/s. These are respectively 6% and 32% performance gains.

For most of the tests, I used a different system with AMD Ryzen 5 3600 CPU, GCC 10.2.1 and libc 2.32. Note that the libc version is much newer than on DAS-5 (2.17), this libc has a much-improved Sine/Cosine. Some results:

  • Default: 145 GFlop/s
  • MKL: 75 GFlop/s (From Intel 2020.2. Intel limits the MKL performance on non-Intel hardware)
  • MKL + hack: 175 GFlop/s (Trick MKL into believing it is dealing with an Intel CPU, see.)
  • Lookup table: 125 GFlop/s
  • Phasor extrapolation: 240 GFlop/s (+65% over default, +37% over MKL+hack)
  • Phasor extrapolation + MKL + hack: 222 GFlop/s (unknown why this is slower than without MKL)

For comparison, this is the best I got with the Intel (2020.2) compiler:

  • Intel compiler + MKL + hack + extrapolate: 193 GFlop/s (+33% over default)

I tested correctness with test-cpu-optimized.x, which compares the result of the CPU Optimized gridder kernels versus the CPU Reference gridder kernel.

  • Default: r_error: 0.030066, i_error: 0.001318
  • Lookup table: r_error: 0.050847, i_error: 0.001660 (can be reduced by increasing size of the lookup table)
  • Phasor extrapolation: r_error: 0.039210, i_error: 0.001452 (depends on the number of channels, this test has 9 channels)

To summarize The CPU Optimized kernel code has been cleaned up a little. A new optional phasor extrapolation implementation was added to the CPU Optimized gridder kernel. This implementation has a superior performance (about 30% over the typical 'best-case' setting), while the accuracy seems to be rather good.

Merge request reports