Skip to content

Draft: AMD support

Leon Oostrum requested to merge amd-support into master

Add support for AMD GPUs with matrix cores: RDNA3 (gfx11) and CDNA1/2/3 (gfx9)

  • 4-bit is not supported. Unavailable on CDNA, could be supported with intrinsics on RDNA3
  • 8-bit mode is not supported on CDNA3. CDNA3 no longer has a 16x16x16 fragment size, only 16x16x32 or 32x32x8
  • Functions to detect the NVRTC include path and GPU architecture were removed from the correlator, these are supported through cudawrappers
  • CorrelatorTest was expanded with some powers-of-two as well as some combinations of parameters that were originally failing where others passed on AMD
  • PMT is supported through its ROCm sensor
  • Directly writing the visibilities from registers to device memory is supported, but performance is typically lower than portable mode due to the different fragment layout on AMD vs NVIDIA: on AMD, one thread of a warp has access to different rows of the same column of the fragment matrix, while on NVIDIA one thread has access to different columns of the same row. This means that on NVIDIA, one thread has both the real and imaginary part of a few visibilities, while on AMD they have access to either the real or imaginary part. Device functions were added to store one half of the visibility.

In progress:

  • Directly writing the visibilities from registers has a non-yet identified bug, at least in 16-bit mode: the results are wrong with e.g. 33 receivers and 32 receivers per block. A workaround is to add an extra syncthreads after writing each (half) visibility. This bug does not show up in portable mode, and I haven't been able to reproduce it in 8-bit mode either.
  • Some work is done to support wave64 mode (required for CDNA GPUs): On the host side this should be done, but the kernel still needs to be checked thoroughly. Some combinations of parameters give the right results, but many combinations won't work.
Edited by Leon Oostrum

Merge request reports

Loading