Skip to content
Snippets Groups Projects

Repository graph

You can move around the graph by using the arrow keys.
Select Git revision
  • cmake-build
  • master default
  • amd-support
  • experimental-bulk-copies-backup
  • fix-opencl-test
5 results
Created with Raphaël 2.2.025Aug22208765431Jul302928222118162124Jun13May29Apr18151427Mar8Jan9Dec31Oct302821181716151128Jun261929May10Apr27Feb12Jan23Dec13Oct621Sep10Aug9212Jul29Jun20May24Apr22Sep5Jul29Jun30May2716Mar11Feb28Jan15Dec22Nov27Oct1Sep31Aug3027Register OpenCLCorrelatorTest with CTestfix-opencl-testfix-opencl-testFixed OpenCL test program.Merge branch 'experimental-bulk-copies' into 'master'mastermasterRemoved code in commentsRemoved debug code.Removed debug code.Fixed approximates(), so that it now also catches sign errors.Fixed fp8 conjugation bug.Merge branch 'experimental-bulk-copies' of https://git.astron.nl/RD/tensor-core-correlator into experimental-bulk-copiesAdded [[deprecated]] backward-compatible Correlator() constructor.Removed white space.Updated README.md for added fp4 support.Removed pre-async-copies-era code.No longer necessary to use an old cuda wrappers version.More extensive and faster benchmarking.Allow setting the number of thread blocks per SM.Use std::optional<>. API change!Print #registers (commented out).Run fewer thread blocks / SM if not using async copies (due to register use).Fixed default nrReceiversPerBlock,Simplification.Minot i4 optimization.Changes int8 shape from m16n16k16 to m16n8k32.Run ptxas only once.Removed (broken) PORTABLE support.Fixed Volta.Fix !defined ASYNC_COPIES.Fixed CP_ASYNC_BULK, but still disabled; it is slower than ASYNC_COPIES.Simplified copyAsync().Use CUDA_VERSION.Sped up C++ verification code.Added fp4 to benchmark.Adapted usage() for fp4.Fixes builds with CUDA 12.7 and older.Added fp4 support.API change: export nrTimesPerBlock.Fixed i4.Fixed fp8 conj_perm()Slightly faster implemention of conj_perm on GH200.Also use int2 loads for fp16 (matrix entries changed in K order)
Loading