Add support for extreme cases in QuantizeOutput kernel
Due to pipeline-buffers.txt
it was assumed that the CoherentStokes
or IncoherentStokes
kernel never produced more than 12288
samples per visibility. However, this is not true in 'extreme cases' (according to Sarod and Cees). The updated kernel uses the same shared memory buffer (for at most 12288
samples) and iterates over the input in batches if the input is larger than the batch size. The 'two-pass' implementation is removed in the process.
Furthermore, in case of a large number of channels (e.g. 512) and using all four Stokes parameters, the number of supported TABs is limited to 32 (due to CUDA not supporting grids larger than 65536 blocks). As a workaround, thread blocks are now assigned to stokes and TABs (not to channels) and iterate over all channels.