Skip to content

Edge case handling: what if M,N,K_GLOBAL are smaller than M,N,K per block/wmma?

We could lower the values per block/warp to whatever the minimum at the wmma level is. Padding is required if the matrix is smaller than one wmma fragment.

NBUFFER should also be at most K_GLOBAL/K_PER_WMMA.