From a5ec00d95f7d6187056d15ed61d7c80e59e42546 Mon Sep 17 00:00:00 2001 From: John Romein <romein@astron.nl> Date: Mon, 30 Aug 2021 13:21:14 +0000 Subject: [PATCH] Update README.md --- README.md | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 53 insertions(+) diff --git a/README.md b/README.md index 4e8cc0b..c8d2261 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,55 @@ # Tensor-Core Correlator +## Brief overview on how to use the Tensor-Core Correlator library: + +Build the library (just type `make`) + +Include `libtcc/Correlator.h`, and link with `libtcc/libtcc.so`. +Create a `tcc::Correlator` object with the number of receivers, channels, etc. +as arguments; this will automatically compile the CUDA code (at runtime). +Use the launchAsync method to correlate a block of samples; you must make +sure that the samples data is already in device memory. +The TCC adheres to RAII: any error will result in the failure to create +an `tcc::Correlator()` object (and throw some explanatory exception). +`test/SimpleExample/SimpleExample.cu` illustrates how the TCC library can be used. + +The TCC internally uses wrappers around the CUDA driver API (`util/cu.h`) and +the NVRTC library (`util/nvrth.h`). The rest of the correlator code can use +these wrappers as well, use the CUDA driver API directly, use the CUDA +runtime API, or the OpenCL environment. See: `test/SimpleExample/SimpleExample.cu` +on how to use the CUDA runtime API; `test/CorrelatorTest/CorrelatorTest.cc` on +how to use the CUDA driver API (wrappers); and +`test/OpenCLCorrelatorTest/OpenCLCorrelatorTest.cc` on how to use TCC in an +OpenCL program. `test/CorrelatorTest/CorrelatorTest.cc` is a much more versatile, +robust (and complex) example than `test/SimpleExample/SimpleExample.cu`. + +Input and output data types are defined as follows: + +``` +#if NR_BITS == 4 +typedef complex_int4_t Sample; +typedef std::complex<int32_t> Visibility; +#elif NR_BITS == 8 +typedef std::complex<int8_t> Sample; +typedef std::complex<int32_t> Visibility; +#elif NR_BITS == 16 +typedef std::complex<__half> Sample; +typedef std::complex<float> Visibility; +#endif + +#define NR_TIMES_PER_BLOCK (128 / NR_BITS) + +typedef Sample Samples[NR_CHANNELS][NR_SAMPLES_PER_CHANNEL / NR_TIMES_PER_BLOCK][NR_RECEIVERS][NR_POLARIZATIONS][NR_TIMES_PER_BLOCK]; +typedef Visibility Visibilities[NR_CHANNELS][NR_BASELINES][NR_POLARIZATIONS][NR_POLARIZATIONS]; +``` + +The input data type (`Samples`) is a weird format, but this seemed to be the only +format that yields good performance (tensor cores are very unforgiving). + +Limitations: +- `NR_POLARIZATIONS` must be 2 +- `NR_BITS` must be 4, 8, or 16 +- the amount of samples over which is integrated) must be a multiple of 128 / `NR_BITS` + (i.e., 32, 16, or 8 for 4-bit, 8-bit, or 16-bit input, respectively). + +Contact John Romein (romein@astron.nl) to report bugs/feedback -- GitLab