Optimize correlator
The correlator code, originally added to the LOFAR (now Cobalt) codebase over a decade ago, has remained largely unchanged aside from minor maintenance. Since then, the NVIDIA compiler has significantly improved its ability to optimise code automatically, reducing the need for manual optimisations, such as loop unrolling.
This MR introduces a substantial cleanup and simplification of the kernel code. Key changes include:
- Introducing new helper functions:
load_samples
do_correlate
compute_do_baseline
- Replacing the separate
correlate_1x1
tocorrelate_4x4
functions with a single, templatedcorrelate_nxn
function.
As shown in the figures below, the performance on an NVIDIA Tesla V100 remain virtually unchanged between the original (‘reference’) and updated kernel implementations.
The tCorrelatorPerformance
test only ran when a Tesla K10 GPU was detected. This is now changed to Tesla V100. The runtime using the newest correlator kernel were put in as reference runtimes. Applying this change (only update tCorrelatorPerformance
, but keeping the existing correlator kernel, we see some interesting speedups of the optimised kernel version:
Failure in 48_Stations_250ms_16ch: Expected 2.5 +/- 0.5 but was 5.37528
Failure in 80_Stations_250ms_16ch: Expected 3.6 +/- 0.5 but was 10.1913
In other words, for these cases, this new version is up to 2-3x faster.