Skip to content

New Coherent Stokes Kernel

Bram Veenboer requested to merge new-coherent-stokes into main

The CUDA Coherent Stokes kernel was found to be a bottleneck in beamformer pipelines with large time integration factors (e.g. 16384x). A new kernel with completely different vectorization/parallelization strategy solves this. The kernel is designed for time integration factors of 32, or multiples thereof. Smaller time integration factors are processed using an alternative path in the kernel (with different strategy). The previous Coherent Stokes kernel had a fairly complex heuristic to find the optimal launch configuration. All that code is removed, in favour of much simpler host code.

The new kernel has been tested extensively for correctness. To this end, the test coverage of tCoherentStokesKernel was extended a bit.

Performance tests have performed on an NVIDIA A4000 GPU, as well as on GH200 GPU (in MIG mode) found in a Grace Hopper system. In all cases, the kernel achieves more than 90% of the theoretical GPU memory bandwidth, indicating that the kernel is very close to optimal in terms of performance (it is highly memory bound).

The following plot compares the runtime for 1 frequency channel (which is more challenging than more channels), 128 TABs and 196608 samples per channel on the A4000: image The new kernel outperforms the existing one by at least an order of magnitude in all cases, and much more (> 64x) for time integration factors of 32-256. Consequently, it is also much more energy efficient: image

Again, the energy consumption is at least 10x lower.

Merge request reports