Improve performance of CUDA gridder and degridder with many aterms
The CUDA gridder and degridder kernel have been optimized for the case when multiple aterms need to be applied to a single subgrid. Moreover, the mechanism to select batch and block size parameters in InstanceCUDA
is outdated. Since these new kernels have been tested on an RTX A4000 (ga102), it does not make much sense to keep parameters for other GPUs. In case these changes cause performance issues for older architectures, they will need to be remedied separately.
The performance is measured using a (synthetic) benchmark separate from this repository: