float16 kernel with col-major C matrix fails with different tuning parameters
Running on the A100, all tests pass with the current tuning parameters:
.m_per_block = 128,
.m_per_warp = 128,
.m_per_wmma = 16,
.n_per_block = 64,
.n_per_warp = 16,
.n_per_wmma = 16,
.k_per_wmma = 16,
.nbuffer = 4
The C col-major tests fail when I change them to these values:
.m_per_block = 256,
.m_per_warp = 64,
.m_per_wmma = 16,
.n_per_block = 32,
.n_per_warp = 32,
.n_per_wmma = 16,
.k_per_wmma = 16,
.nbuffer = 2