Skip to content

New Coherent Stokes Kernel

Bram Veenboer requested to merge new-coherent-stokes into cobalt2.1

The CUDA Coherent Stokes kernel was found to be a bottleneck in beamformer pipelines with large time integration factors (e.g. 16384x). A new kernel with completely different vectorization/parallelization strategy solves this. The kernel is designed for time integration factors of 32, or multiples thereof. Smaller time integration factors are processed using an alternative path in the kernel (with different strategy). The previous Coherent Stokes kernel had a fairly complex heuristic to find the optimal launch configuration. All that code is removed, in favour of much simpler host code.

The new kernel has been tested extensively for correctness. To this end, the test coverage of tCoherentStokesKernel was extended a bit.

Performance tests have performed on a NVIDIA RTX A4000 GPU, a GH200 GPU (in MIG mode) as found in a Grace Hopper system and on a NVIDIA RTX 4000 Ada GPU. In all cases, the kernel achieves more than 90% of the theoretical GPU memory bandwidth, indicating that the kernel is very close to optimal in terms of performance (it is highly memory bound).

The following plot compares the runtime for 1 frequency channel (which is more challenging than more channels), 128 TABs and 196608 samples per channel on the RTX 4000 Ada: image

The new kernel outperforms the existing one by at least an order of magnitude in all cases, it is also much more energy efficient: image

Again, the energy consumption is at least 10x lower.

Edited by Bram Veenboer

Merge request reports

Loading