Skip to content
Snippets Groups Projects
Select Git revision
  • 52ac727c0b5ff52ee4b1e55b91e9026c5617c6b5
  • master default
  • experimental-bulk-copies
  • amd-support
  • cmake-build
5 results

tensor-core-correlator

  • Clone with SSH
  • Clone with HTTPS
  • Tensor-Core Correlator

    The Tensor-Core Correlator is a GPU library that exploits the tensor cores of modern NVIDIA GPUs to compute cross/auto correlations 5-10 times more efficiently than regular GPU cores. Its primary use is to combine the signals of (many) receivers of a radio telescope. The library can be used in any FX correlator, but is not a full correlator application: it only computes the correlations. The rest of the application should take care of I/O, filtering, etc. For more information, see the paper (John W. Romein, The Tensor-Core Correlator, Astronomy and Astrophysics, 656(A32), pages 1-4, December 2021).

    Brief overview on how to use the Tensor-Core Correlator library:

    Clone the and build the repository, see below.

    Include libtcc/Correlator.h, and link with libtcc/libtcc.so. Create a tcc::Correlator object with the number of receivers, channels, etc. as arguments; this will automatically compile the CUDA code (at runtime). Use the launchAsync method to correlate a block of samples; you must make sure that the samples data is already in device memory. The TCC adheres to RAII: any error will result in the failure to create an tcc::Correlator() object (and throw some explanatory exception). test/SimpleExample/SimpleExample.cu illustrates how the TCC library can be used.

    The TCC internally uses wrappers around the CUDA driver API (util/cu.h) and the NVRTC library (util/nvrth.h). The rest of the correlator code can use these wrappers as well, use the CUDA driver API directly, use the CUDA runtime API, or the OpenCL environment. See: test/SimpleExample/SimpleExample.cu on how to use the CUDA runtime API; test/CorrelatorTest/CorrelatorTest.cc on how to use the CUDA driver API (wrappers); and test/OpenCLCorrelatorTest/OpenCLCorrelatorTest.cc on how to use TCC in an OpenCL program. test/CorrelatorTest/CorrelatorTest.cc is a much more versatile, robust (and complex) example than test/SimpleExample/SimpleExample.cu.

    The TCC accepts the following input data types:

    • half precision floating point (a.k.a. fp16), starting from Volta (sm_70)
    • e4m3 and e5m2 (a.k.a. fp8), starting from Hopper (sm_90)
    • 8-bit integers (i8), starting from the Jetson Xavier (sm_72)
    • 4-bit integers (i4), only natively supported on Ampere and Ada
    #if INPUT_FORMAT == FORMAT_I4
    #define NR_TIMES_PER_BLOCK    32
    typedef complex_int4_t        Sample;
    typedef std::complex<int32_t> Visibility;
    #elif INPUT_FORMAT == FORMAT_I8
    #define NR_TIMES_PER_BLOCK    16
    typedef std::complex<int8_t>  Sample;
    typedef std::complex<int32_t> Visibility;
    #elif INPUT_FORMAT == FORMAT_E4M3
    #define NR_TIMES_PER_BLOCK    16
    typedef std::complex<__nv_fp8_e4m3>  Sample;
    typedef std::complex<float>   Visibility;
    #elif INPUT_FORMAT == FORMAT_E5M2
    #define NR_TIMES_PER_BLOCK    16
    typedef std::complex<__nv_fp8_e5m2>  Sample;
    typedef std::complex<float>   Visibility;
    #elif INPUT_FORMAT == FORMAT_FP16
    #define NR_TIMES_PER_BLOCK    8
    typedef std::complex<__half>  Sample;
    typedef std::complex<float>   Visibility;
    #endif
    
    
    typedef Sample Samples[NR_CHANNELS][NR_SAMPLES_PER_CHANNEL / NR_TIMES_PER_BLOCK][NR_RECEIVERS][NR_POLARIZATIONS][NR_TIMES_PER_BLOCK];
    typedef Visibility Visibilities[NR_CHANNELS][NR_BASELINES][NR_POLARIZATIONS][NR_POLARIZATIONS];

    Note that with FORMAT_I4 and FORMAT_I8, the input samples may not contain -8 or -128 respectively, as these values cannot be conjugated properly. The input data type (Samples) is a weird format, but this seemed to be the only format that yields good performance (tensor cores are very unforgiving).

    Limitations:

    • NR_POLARIZATIONS must be 2
    • the amount of samples over which is integrated must be a multiple of NR_TIMES_PER_BLOCK.

    Building, testing, and installation

    Clone the repository:

    git clone https://git.astron.nl/RD/tensor-core-correlator.git

    To build and install the project, run:

    cmake -S . -B build
    make -C build
    make -C build install

    To install in a custom location, e.g. ~/.local, run:

    cmake -S . -B build -DCMAKE_INSTALL_PREFIX=$HOME/.local
    make -C build
    make -C build install

    To compile and run the tests, run:

    cmake -S. -B build -DBUILD_TESTING=ON
    make -C build
    make -C build test

    The tests require a GPU. On the DAS-6/ASTRON cluster you can request a GPU node and run the tests with the command:

    srun -N 1 --gres=gpu:A4000:1 make -C build test

    Note that in the command above a node with a NVIDIA A4000 GPU is requested, because the tests require a GPU that has tensor cores.

    Example usage

    The example subdirectory has a minimal example that demonstrates how this library can be integrated in another project. This example assumes that you pre-installed both this library (libtcc) and cudawrappers. E.g. when libtcc is installed in <prefix>/libtcc and cudawrappers is installed in <prefix>/cudawrappers, you can build the example by running:

    cmake . -DCMAKE_PREFIX_PATH="<prefix>/cudawrappers;<prefix>/tcc"
    make
    ./example

    Bugs/feedback

    Contact John Romein (romein@astron.nl) to report bugs/feedback