Move reinterpret_cast
out of loop to improve performance for fp16x2 kernel
This MR improves the performance of the fp16x2 kernel by moving the reinterpret_cast
outside the inner loop
Before:
Device 0: NVIDIA RTX A4000 (48SMs, 1.56 Ghz)
fp16: 283.14 ms, 11.65 TOps/s
fp16x2: 250.22 ms, 13.18 TOps/s
After:
Device 0: NVIDIA RTX A4000 (48SMs, 1.56 Ghz)
fp16: 283.08 ms, 11.65 TOps/s
fp16x2: 149.21 ms, 22.11 TOps/s