Tune CUDA gridder and degridder kernels (!195) · Merge requests · ResearchAndDevelopment / idg

Bram Veenboer requested to merge tuning-applied into master Feb 24, 2022

Tune the CUDA gridder and degridder kernels with the help of the kernel tuner added in !193 (merged).

Many recent (and slightly older) GPU architectures have been tested. "ref" refers to the kernel in master, and "new" to the current state. The numbers are performance in GFLOPS, as reported by cuda-generic.x. Especially, the degridder kernel got a lot faster on most architectures. The gridder kernel performance didn't change much, and seems to be in run-to-run variation.

device	gridder ref	gridder new	degridder ref	degridder new
Tesla K40 (Kepler, 3.5)	1241.38	1373.98	505.07	931.36
Titan X (Maxwell, 5.2)	3962.73	3875.11	3074.02	3263.03
Titan X (Pascal, 6.1)	7205.79	7245.66	5926.93	6277.84
Tesla V100 (Volta, 7.0)	11656.06	11694.43	9490.25	11326.06
Titan RTX (Turing, 7.5)	13718.93	13557.87	9309.87	9363.47
RTX A100 (Ampere, 8.0)	15599.68	15316.14	14314.83	15641.83
RTX A4000 (Ampere, 8.6)	9419.58	9513.16	8788.39	10871.11

Most importantly, we now have a simple and automated method of tuning the kernels for any (new) architecture and/or set of parameters.

Some observations:

Having BATCH_SIZE > NUM_THREADS never seems to improve performance
We need __launch_bounds__ (like earlier) to get the best performance on older architectures
Despite differences in architecture, it is striking that all architecture share the same optimal number of threads per block (256 for the gridder kernel and 64 for the degridder kernel).

Edited Feb 25, 2022 by Bram Veenboer

Tune CUDA gridder and degridder kernels

Merge request reports