Skip to content
Snippets Groups Projects
Select Git revision
  • 3e4db43f3d5ffd0c4015b0c4bb2afb257b6823f9
  • master default
  • experimental-bulk-copies
  • amd-support
  • cmake-build
5 results

tensor-core-correlator

  • Open with
  • Download source code
  • John Romein's avatar
    John Romein authored
    Don't build Record and UnitTest as shared libraries
    
    See merge request !6
    3e4db43f
    History

    Tensor-Core Correlator

    The Tensor-Core Correlator is a GPU library that exploits the tensor cores of modern NVIDIA GPUs to compute cross/auto correlations 5-10 times more efficiently than regular GPU cores. Its primary use is to combine the signals of (many) receivers of a radio telescope. The library can be used in any FX correlator, but is not a full correlator application: it only computes the correlations. The rest of the application should take care of I/O, filtering, etc. For more information, see the paper (John W. Romein, The Tensor-Core Correlator, Astronomy and Astrophysics, 656(A32), pages 1-4, December 2021).

    Brief overview on how to use the Tensor-Core Correlator library:

    Clone the and build the repository, see below.

    Include libtcc/Correlator.h, and link with libtcc/libtcc.so. Create a tcc::Correlator object with the number of receivers, channels, etc. as arguments; this will automatically compile the CUDA code (at runtime). Use the launchAsync method to correlate a block of samples; you must make sure that the samples data is already in device memory. The TCC adheres to RAII: any error will result in the failure to create an tcc::Correlator() object (and throw some explanatory exception). test/SimpleExample/SimpleExample.cu illustrates how the TCC library can be used.

    The TCC internally uses wrappers around the CUDA driver API (util/cu.h) and the NVRTC library (util/nvrth.h). The rest of the correlator code can use these wrappers as well, use the CUDA driver API directly, use the CUDA runtime API, or the OpenCL environment. See: test/SimpleExample/SimpleExample.cu on how to use the CUDA runtime API; test/CorrelatorTest/CorrelatorTest.cc on how to use the CUDA driver API (wrappers); and test/OpenCLCorrelatorTest/OpenCLCorrelatorTest.cc on how to use TCC in an OpenCL program. test/CorrelatorTest/CorrelatorTest.cc is a much more versatile, robust (and complex) example than test/SimpleExample/SimpleExample.cu.

    Input and output data types are defined as follows:

    #if NR_BITS == 4
    typedef complex_int4_t        Sample;
    typedef std::complex<int32_t> Visibility;
    #elif NR_BITS == 8
    typedef std::complex<int8_t>  Sample;
    typedef std::complex<int32_t> Visibility;
    #elif NR_BITS == 16
    typedef std::complex<__half>  Sample;
    typedef std::complex<float>   Visibility;
    #endif
    
    #define NR_TIMES_PER_BLOCK (128 / NR_BITS)
    
    typedef Sample Samples[NR_CHANNELS][NR_SAMPLES_PER_CHANNEL / NR_TIMES_PER_BLOCK][NR_RECEIVERS][NR_POLARIZATIONS][NR_TIMES_PER_BLOCK];
    typedef Visibility Visibilities[NR_CHANNELS][NR_BASELINES][NR_POLARIZATIONS][NR_POLARIZATIONS];

    Note that in 4-bit and 8-bit mode, the input samples may not contain -8 or -128 respectively, as these values cannot be conjugated properly. The input data type (Samples) is a weird format, but this seemed to be the only format that yields good performance (tensor cores are very unforgiving).

    Limitations:

    • NR_POLARIZATIONS must be 2
    • NR_BITS must be 4, 8, or 16
    • the amount of samples over which is integrated) must be a multiple of 128 / NR_BITS (i.e., 32, 16, or 8 for 4-bit, 8-bit, or 16-bit input, respectively).

    Building, testing, and installation

    Clone the repository:

    git clone https://git.astron.nl/RD/tensor-core-correlator.git

    To build and install the project, run:

    cmake -S . -B build
    make -C build
    make -C build install

    To install in a custom location, e.g. ~/.local, run:

    cmake -S . -B build -DCMAKE_INSTALL_PREFIX=$HOME/.local
    make -C build
    make -C build install

    To compile and run the tests, run:

    cmake -S. -B build -DBUILD_TESTING=ON
    make -C build
    make -C build test

    The tests require a GPU. On the DAS-6/ASTRON cluster you can request a GPU node and run the tests with the command:

    srun -N 1 --gres=gpu:A4000:1 make -C build test

    Note that in the command above a node with a NVIDIA A4000 GPU is requested, because the tests require a GPU that has tensor cores.

    Example usage

    The example subdirectory has a minimal example that demonstrates how this library can be integrated in another project. This example assumes that you pre-installed both this library (libtcc) and cudawrappers. E.g. when libtcc is installed in <prefix>/libtcc and cudawrappers is installed in <prefix>/cudawrappers, you can build the example by running:

    cmake . -DCMAKE_PREFIX_PATH="<prefix>/cudawrappers;<prefix>/tcc"
    make
    ./example

    Bugs/feedback

    Contact John Romein (romein@astron.nl) to report bugs/feedback