Tune CUDA gridder and degridder kernels
Tune the CUDA gridder and degridder kernels with the help of the kernel tuner added in !193 (merged).
Many recent (and slightly older) GPU architectures have been tested. "ref" refers to the kernel in master
, and "new" to the current state. The numbers are performance in GFLOPS, as reported by cuda-generic.x
. Especially, the degridder kernel got a lot faster on most architectures. The gridder kernel performance didn't change much, and seems to be in run-to-run variation.
device | gridder ref | gridder new | degridder ref | degridder new |
---|---|---|---|---|
Tesla K40 (Kepler, 3.5) | 1241.38 | 1373.98 | 505.07 | 931.36 |
Titan X (Maxwell, 5.2) | 3962.73 | 3875.11 | 3074.02 | 3263.03 |
Titan X (Pascal, 6.1) | 7205.79 | 7245.66 | 5926.93 | 6277.84 |
Tesla V100 (Volta, 7.0) | 11656.06 | 11694.43 | 9490.25 | 11326.06 |
Titan RTX (Turing, 7.5) | 13718.93 | 13557.87 | 9309.87 | 9363.47 |
RTX A100 (Ampere, 8.0) | 15599.68 | 15316.14 | 14314.83 | 15641.83 |
RTX A4000 (Ampere, 8.6) | 9419.58 | 9513.16 | 8788.39 | 10871.11 |
Most importantly, we now have a simple and automated method of tuning the kernels for any (new) architecture and/or set of parameters.
Some observations:
- Having
BATCH_SIZE
>NUM_THREADS
never seems to improve performance - We need
__launch_bounds__
(like earlier) to get the best performance on older architectures - Despite differences in architecture, it is striking that all architecture share the same optimal number of threads per block (256 for the gridder kernel and 64 for the degridder kernel).
Edited by Bram Veenboer