idg-bin/tests/python/cuda-unified.py seg faults during FFT

cuda-generic.py runs fine. Python 3.8.5, CUDA 11.2. Bram was able to reproduced it last Friday.
$ python cuda-unified.py 
Error importing OpenCL: ('/opt/lib/libidg-opencl.so: cannot open shared object file: No such file or directory',)
>> Dataset full: 
number of stations: 52
number of baselines: 1326
longest baseline = 1980.4 km
maximum grid size: 3772986368
longest baseline required: 29.12 km
>> Dataset limited to baseline up to 29.12 km: 
number of stations: 52
number of baselines: 530
longest baseline = 28.6078 km
>> Dataset limited to 190 baselines: 
number of stations: 52
number of baselines: 190
longest baseline = 28.6078 km
CUDA::default_info
Searching for source files in: /opt/lib/idg-cuda
Temporary files will be stored in: /tmp/idg-0mn9Wf
CUDA::CUDA
InstanceCUDA
set_parameters
compile_kernels
Searching for source files in: /opt/lib/idg-cuda
Temporary files will be stored in: /tmp/idg-gTmQyo
Compiling /tmp/idg-0mn9Wf/Splitter.cubin
/usr/local/cuda/bin/nvcc -cubin  -use_fast_math  -G -src-in-ptx -arch=sm_86 -DNR_POLARIZATIONS=4 -I/opt/include -DTILE_SIZE_GRID=128 -o /tmp/idg-0mn9Wf/Splitter.cubin /opt/lib/idg-cuda/KernelSplitter.cu
Compiling /tmp/idg-0mn9Wf/Calibrate.cubin
Compiling /tmp/idg-0mn9Wf/KernelFFTShift.cubin
/usr/local/cuda/bin/nvcc -cubin  -use_fast_math  -G -src-in-ptx -arch=sm_86 -DNR_POLARIZATIONS=4 -I/opt/include -o /tmp/idg-0mn9Wf/KernelFFTShift.cubin /opt/lib/idg-cuda/KernelFFTShift.cu
/usr/local/cuda/bin/nvcc -cubin  -use_fast_math  -G -src-in-ptx -arch=sm_86 -DNR_POLARIZATIONS=4 -I/opt/include -o /tmp/idg-0mn9Wf/Calibrate.cubin /opt/lib/idg-cuda/KernelCalibrate.cu
Compiling /tmp/idg-0mn9Wf/AverageBeam.cubin
Compiling /tmp/idg-0mn9Wf/Scaler.cubin
Compiling /tmp/idg-0mn9Wf/Gridder.cubin/usr/local/cuda/bin/nvcc -cubin  -use_fast_math  -G -src-in-ptx -arch=sm_86 -DNR_POLARIZATIONS=4 -I/opt/include -o /tmp/idg-0mn9Wf/Scaler.cubin /opt/lib/idg-cuda/KernelScaler.cu
/usr/local/cuda/bin/nvcc -cubin  -use_fast_math  -G -src-in-ptx -arch=sm_86 -DNR_POLARIZATIONS=4 -I/opt/include -o /tmp/idg-0mn9Wf/AverageBeam.cubin /opt/lib/idg-cuda/KernelAverageBeam.cu

Compiling /tmp/idg-0mn9Wf/Adder.cubin
Compiling /usr/local/cuda/bin/nvcc -cubin  -use_fast_math  -G -src-in-ptx -arch=sm_86 -DNR_POLARIZATIONS=4 -I/opt/include -DTILE_SIZE_GRID=128 -o /tmp/idg-0mn9Wf/Adder.cubin /opt/lib/idg-cuda/KernelAdder.cu
/usr/local/cuda/bin/nvcc -cubin  -use_fast_math  -G -src-in-ptx -arch=sm_86 -DNR_POLARIZATIONS=4 -I/opt/include -DBATCH_SIZE=128 -o /tmp/idg-0mn9Wf/Gridder.cubin /opt/lib/idg-cuda/KernelGridder.cu
/tmp/idg-0mn9Wf/KernelWtiling.cubin
Compiling /tmp/idg-0mn9Wf/Degridder.cubin
/usr/local/cuda/bin/nvcc -cubin  -use_fast_math  -G -src-in-ptx -arch=sm_86 -DNR_POLARIZATIONS=4 -I/opt/include -DBATCH_SIZE=256 -o /tmp/idg-0mn9Wf/Degridder.cubin /opt/lib/idg-cuda/KernelDegridder.cu
/usr/local/cuda/bin/nvcc -cubin  -use_fast_math  -G -src-in-ptx -arch=sm_86 -DNR_POLARIZATIONS=4 -I/opt/include -o /tmp/idg-0mn9Wf/KernelWtiling.cubin /opt/lib/idg-cuda/KernelWtiling.cu
CUDA::initialize_buffers
CUDA::free_buffers
Devices: 
GeForce RTX 3090
	Device memory : 12241 Mb  / 24268 Mb (free / total)
	Shared memory : 48.00 Kb
	Clk frequency : 1695 Ghz
	Mem frequency : 9751 Ghz
	Number of SM  : 82
	Mem bus width : 384 bit
	Mem bandwidth : 936 GB/s
	Number of threads  : 1536
	Capability    : 86
	Unified memory : 1


Compiler flags: 
 -use_fast_math  -G -src-in-ptx -arch=sm_86 -DNR_POLARIZATIONS=4 -I/opt/include

Generic::Generic
Unified::Unified
nr_stations           =  20
nr_baselines          =  190
nr_channels           =  1
nr_timesteps          =  7200
nr_timeslots          =  16
nr_correlations       =  4
subgrid_size          =  24
grid_size             =  2048
image_size            =  0.0592
kernel_size           =  13
integration_time      =  0.9
Plan::Plan
Plan::initialize
kernel_size  : 13
subgrid_size : 24
grid_size    : 2048
nr_baselines    : 190 (input)
nr_timesteps    : 7200 (per baseline)
nr_channels     : 1 (per baseline)
nr_visibilities : 1368000 (planned)
nr_subgrids     : 1827 (planned)
Unified::do_gridding
### Initialize gridding
CUDA::initialize
CUDA::compute_jobsize
CUDA::cleanup
CUDA::initialize_buffers
CUDA::free_buffers
nr_stations  = 20
nr_timeslots = 16
nr_timesteps = 7200
nr_channels  = 1
subgrid_size = 24
nr_baselines = 190
max_jobsize  = 0
Bytes required for static data: 11446276
Bytes required for job data: 1742880
Bytes free: 12836208640
Bytes reserved: 5134483456
Jobsize: 190
### Run gridding
Generic::run_gridding
CUDA::do_transform
Segmentation fault (core dumped)
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information