Reduce memory in predict step
This MR changes how sources are parallelized over: before this was done with a nested (dynamic) for. Before entering the nested for, memory was allocated for every possible thread index. If the prediction itself was inside a RecursiveFor, every nested for loop would allocate n_threads buffers, leading to a n_threads^2 memory requirement.
This MR changes the source parallelization to a StaticFor. The StaticFor divides the tasks into subranges. Initialization and allocation is now done before every subrange. If inside a RecursiveFor, fewer subranges would run at the same time, and overal at most n_threads subranges are running simultaneously, which fixes the issue.
This MR also fixes a performance issue: because parallelization was done in a 'dynamic for' fashion, all threads could be processing sources from the same patches. This would cause every thread to have to apply the beam, leading to many more beam evaluations than necessary. By dividing the total nr of sources into subranges, this should behave now much better.