"Device or resource busy" with Toil and Slurm
To make better use of cluster resources I'm trying to run LINC with Toil, but running into a little snag. I have it running for the most part, submitting jobs to a Slurm queue, running the first calibrate step etc. but somehow it crashes when it does the first applycal of the PA solutions. The error I get is
WARNING: SINGULARITY_BINDPATH and APPTAINER_BINDPATH have different values, using the latter
WARNING: SINGULARITYENV_TMPDIR and APPTAINERENV_TMPDIR have different values, using the latter
WARNING: Skipping environment variable [SINGULARITYENV_TMPDIR=/tmp], TMPDIR is already overridden with different value [/cosma8/data/do011/dc-swei1/tmp.8wJZyv9ssH/tmpdir_LINC_calibrator/]
terminate called after throwing an instance of 'casacore::AipsError'
what(): RegularFile::move error on /MVptMR/out_L726576_SB002_uv.MS/table.dat_tmp to /MVptMR/out_L726576_SB002_uv.MS/table.dat: Device or resource busy
with the accompanying DP3 output of
WARNING - software has been build with -march=znver2 but current machine reports -march=sapphirerapids.
If you encounter strange behaviour or Illegal instruction warnings, consider building a container with the appropriate architecture set.
WARNING - software has been build with -mtune=znver2 but current machine -mtune=sapphirerapids.
If you encounter strange behaviour or Illegal instruction warnings, consider building a container with the appropriate architecture set.
MSReader
input MS: /MVptMR/out_L726576_SB002_uv.MS
band 0
startchan: 0 (0)
nchan: 4 (0)
ncorrelations: 4
nbaselines: 2775
first time: 2019/07/12/14:00:02
last time: 2019/07/12/14:09:59
ntimes: 150
time interval: 4.00556
DATA column: DATA
WEIGHT column: WEIGHT_SPECTRUM
FLAG column: FLAG
autoweight: false
ApplyCal applycal.
H5Parm: /var/lib/cwl/stge175b851-7861-4d3d-a674-f8b1b494d07b/cal_solutions.h5
SolSet: calibrator
SolTab: polalign
Direction: 0
Interpolation: nearest
Missing antennas: error
Correction: diagonal phase
Update weights: false
Invert: true
SigmaMMSE: 0
TimeSlotsPerParmUpdate: 200
Counter count.
MSUpdater msout.
MS: /MVptMR/out_L726576_SB002_uv.MS
datacolumn: CORRECTED_DATA (has been added to the MS)
flagcolumn: FLAG
weightcolumn WEIGHT_SPECTRUM
writing: data flags
Compressed: no
flush: 0
*** WARNING: the following parset keywords were not used ***
maybe they are misspelled
msout.storagemanager.databitrate
Processing 150 time slots ...
0%....10....20....30....40....50....60....70....80....90....100%
Finishing processing ...
NaN/infinite data flagged in reader
===================================
Percentage of flagged visibilities detected per correlation:
[753201,0,0,0] out of 1665000 visibilities [45%, 0%, 0%, 0%]
0 missing time slots were inserted
Flags set by OneApplyCal applycal.
<snip>
Has anyone managed to run LINC successfully with Toil yet? I'm posting here as I'm not entirely sure why it's specifically this step that crashes here, if it's my environment/container/setup or the workflow, and if some other eyes maybe have an idea of where to look further for debugging.
The way I set things up, if useful information, can be found here: https://github.com/tikk3r/flocs/blob/toil-runners/runners/run_LINC_calibrator_HBA_toil.sh
I'm using toil 6.0.0