Edge case handling: what if M,N,K_GLOBAL are smaller than M,N,K per block/wmma?
We could lower the values per block/warp to whatever the minimum at the wmma level is. Padding is required if the matrix is smaller than one wmma fragment.
NBUFFER
should also be at most K_GLOBAL/K_PER_WMMA
.