Skip to content

Add phasor extrapolation to CUDA gridder and degridder kernels + correctness tests

Bram Veenboer requested to merge cuda-phasor-extrapolation into master

The gridder and degridder kernels heavily rely on sine/cosine to compute the phasor. On NVIDIA GPUs, this doesn't impact performance (significantly), since the ratio of 17 FMA operations for every one sine/cosine evaluation in these kernels matches the hardware: up till Turing, every SM had 1 special function unit (SFU) for every 16 FP32 units. On Ampere (GA102/GA104, not GA100), this ratio is reduced to only 1 SFU for every FP32 units. Moreover, when we would start using FP16 or even Tensor Cores for some operations in these kernels, the SFUs will become a bottleneck.

We evaluated phasor extrapolation as a potential workaround, for the CPU kernels: !32 (merged)

In this MR, we apply the same methodology for the CUDA gridder and degridder kernels. Previously, we only looked at routines (e.g. gridding and degridding) for correctness. To get a better understanding of the implications of phasor extrapolation (and in the future also for other reduced-precision optimizations), dedicated tests are added to test on gridder and degridder kernel level. These test use the recently introduced CUDA Python bindings. We added simplified reference kernels are added, to which the optimized kernels are compared: both without and with phasor extrapolation.

The same error metric is used as in the C++ tests, with the following results:

Gridder:

  • Default: (2.44-09-3.92e-11j)
  • With extrapolation: (3.09e-08-4.97e-10j)

Degridder:

  • Default: (1.21e-12+2.301e-16j)
  • With extrapolation: (1.49e-12+2.75e-15j)

While these results are by no means complete (many more parameters should ideally be tested), they illustrate that the loss in accuracy by using extrapolation is negligible (smaller than floating-point precision).

Merge request reports