Pre-allocate memory for transposed input
The BeamFormerCCGKernel
now allocates the cu::DeviceMemory
for the transposed input in the constructor, rather than using asynchronous allocation and free on-demand. The asynchronous scheme is surprisingly slow, so this solution improves overall throughput at the cost of keeping some device memory allocated over the lifetime of the kernel class.