Skip to content

float16 kernel with col-major C matrix fails with different tuning parameters

Running on the A100, all tests pass with the current tuning parameters:

    .m_per_block = 128,
    .m_per_warp = 128,
    .m_per_wmma = 16, 

    .n_per_block = 64, 
    .n_per_warp = 16, 
    .n_per_wmma = 16, 

    .k_per_wmma = 16, 

    .nbuffer = 4 

The C col-major tests fail when I change them to these values:

    .m_per_block = 256,
    .m_per_warp = 64, 
    .m_per_wmma = 16, 

    .n_per_block = 32, 
    .n_per_warp = 32, 
    .n_per_wmma = 16, 

    .k_per_wmma = 16, 

    .nbuffer = 2