Move `reinterpret_cast` out of loop to improve performance for fp16x2 kernel (!9) · Merge requests · ResearchAndDevelopment / cudapeak

This MR improves the performance of the fp16x2 kernel by moving the reinterpret_cast outside the inner loop

Before:

Device 0: NVIDIA RTX A4000 (48SMs, 1.56 Ghz)
                fp16:  283.14 ms,   11.65 TOps/s
              fp16x2:  250.22 ms,   13.18 TOps/s

After:

Device 0: NVIDIA RTX A4000 (48SMs, 1.56 Ghz)
                fp16:  283.08 ms,   11.65 TOps/s
              fp16x2:  149.21 ms,   22.11 TOps/s

Move reinterpret_cast out of loop to improve performance for fp16x2 kernel