Skip to content
Snippets Groups Projects

Repository graph

You can move around the graph by using the arrow keys.
Select Git revision
  • cmake-build
  • master default
  • amd-support
  • experimental-bulk-copies
4 results
Created with Raphaël 2.2.016Jul2124Jun13May29Apr18151427Mar8Jan9Dec31Oct302821181716151128Jun261929May10Apr27Feb12Jan23Dec13Oct621Sep10Aug9212Jul29Jun20May24Apr22Sep5Jul29Jun30May2716Mar11Feb28Jan15Dec22Nov27Oct1Sep31Aug3027Fixed fp8 conj_perm()experimental-bu…experimental-bulk-copiesRun ptx assmbler right after compilation.mastermasterTemporarily revert to cudawrappers 0.9.0.Merge branch 'use-cudawrappers-0.9.0' into 'master'Use cudawrappers 0.9.0Slightly faster implemention of conj_perm on GH200.Also use int2 loads for fp16 (matrix entries changed in K order)Reordered A and B matrix along K axis to optimize memory accesses.Fixed e4m3/e5m2 support. Matrix A instead of B fixes the complex numbers.Removed some implicit assumptions that NR_RECEIVERS_PER_TCM_Y equals 8.More efficient B matrix ordering for fp16 and i8; breaks fp8 and i4.Added i8.Merge branch 'fp8' into 'master'Fp8Initial experiments. Works only for e4m3 and i4.Added e4m3 benchmark.Fixed module environment.Updated for e4m3/e5m2 support.Adapted tests to the new input format argument.Adapted "usage" line.Added e5m2 support.Added e4m3 support.Added support for sm_120 (consumer-grade Blackwell)Provide enough parallelism for benchmarking.Merge branch 'fix-pmt' into 'master'Update to new PMT::Create interfacestoreVisibility() now has recvX, recvY as arguments, instead of baseline number.Add HIP launch boundsamd-supportamd-supportRemove temporary syncthreadsImplement AMD visibility store with warp shuffle such that one thread can store full visilibities. Code to be simplified/cleaned upImplement direct write of visibilities from registers on AMD GPUsRevert "Cleanup"Merge branch 'amd-support' of git.astron.nl:RD/tensor-core-correlator into amd-supportCleanupFix PMT ROCm supportAMD arch check only on device compileAdd test with more channelsCleaner fix for wave64Make correlator tests fail when verification failsFix threadIdx-based offset calculation for wave64 in 8/16bit mode
Loading