diff --git a/doc/papers/2011/europar/execution_times.jgr b/doc/papers/2011/europar/execution_times.jgr index dad7110bb083ad8657dc2cf0b142b7b00fd0c77f..bb1dd5592a7bfd090f6729c2de83a29b740fa563 100644 --- a/doc/papers/2011/europar/execution_times.jgr +++ b/doc/papers/2011/europar/execution_times.jgr @@ -21,8 +21,17 @@ awk -v N=1 '/->/ {a+=$3;b+=$5;c+=$7;d+=$9;e+=$11;f+=$13;n++; if(n==N) print "1 " *) newgraph - yaxis min 0 max 100 label : BG/P occupation (% CPU time) - xaxis min 0 max 7 hash 0 label : Case (see Table 1) + yaxis + min 0 + max 100 + size 2.5 + label : BG/P occupation (% CPU time) + xaxis + min 0 + max 7 + hash 0 + size 2.5 + label : Case (see Table 1) (* hash_labels rotate 22 vjt hjr hash_label at 1 : I, 16i, 4s, 543b @@ -89,5 +98,5 @@ newgraph label : 1st all-to-all exchange & \ input handling - legend top defaults hjl linelength 75 x 4 y 90 + legend top defaults hjl linelength 75 x 4 y 97 diff --git a/doc/papers/2011/europar/lofar.pdf b/doc/papers/2011/europar/lofar.pdf index b13b247cd166551943e45d9bf811b605b2fec5d7..784d90fd7671bb8d56ae5b06263c27f1e46a2488 100644 Binary files a/doc/papers/2011/europar/lofar.pdf and b/doc/papers/2011/europar/lofar.pdf differ diff --git a/doc/papers/2011/europar/lofar.tex b/doc/papers/2011/europar/lofar.tex index 51838f2e7f709237906724644592490160578b2c..cf73924439136756b9f02b9ad336f3944ef18495 100644 --- a/doc/papers/2011/europar/lofar.tex +++ b/doc/papers/2011/europar/lofar.tex @@ -161,9 +161,9 @@ A station focusses at a source by applying different delays to the signals from The BG/P, which receives the signals from all stations, again performs delay compensation and beam forming, this time in software. The BG/P beam former can focus on sources anywhere in the fields of view of the station beams, creating \emph{tied-array beams} (beams). An example is shown in Figure \ref{fig:pencilbeams}, in which a station beam (represented by an ellipse) contains several tied-array beams (represented by hexagons). The actual width of the station beam, as well as the width of the tied-array beams, depends on the number as well as the locations of the stations used. Hundreds of tied-array beams are typically needed to fully cover the field of view of a station beam. -Different tied-array beams are created by adding the signals from the individual stations using different delays. The delays that have to be applied to obtain a tied-array beam depends on the relative positions of the stations and the relative direction of the tied-array beam with respect to the station beam. The delays are applied in two phases. First, the streams are aligned by shifting them a whole number of samples with respect to each other, which resolves delay differences up to the granularity of a single sample. Then, the remaining sub-sample delays are compensated for by shifting the phase of the signal. In order to obtain different tied-array beams, only the sub-sample delays have to be adjusted. A phase shift is performed by applying a complex multiplication. To form a beam, the beam former gathers the streams of samples from the stations, multiplies them with precomputed weights representing the required phase shift, and adds the streams together. The same weights are applied to both the X and the Y polarisations. The resulting data stream is called the \emph{XY polarisations}, and consists of 32-bit complex floating point numbers. +Different tied-array beams are created by adding the signals from the individual stations using different delays. The delays that have to be applied to obtain a tied-array beam depends on the relative positions of the stations and the relative direction of the tied-array beam with respect to the station beam. The delays are applied in two phases. First, the streams are aligned by shifting them a whole number of samples with respect to each other, which resolves delay differences up to the granularity of a single sample. Then, the remaining sub-sample delays are compensated for by shifting the phase of the signal. In order to obtain different tied-array beams, only the sub-sample delays have to be adjusted. A phase shift is performed by applying a complex multiplication. To form a beam, the beam former gathers the streams of samples from the stations, multiplies them with precomputed weights representing the required phase shift, and adds the streams together. The same weights are applied to both the X and the Y polarisations. The resulting data stream is called the \emph{XY polarisations}, and consists of 32-bit complex floating point numbers (complex floats). -The XY polarisations can optionally be converted into \emph{Stokes IQUV} parameters, which represent the polarisation aspects in an alternative way. The Stokes parameters are defined as $I = X\overline{X} + Y\overline{Y}$, $Q = X\overline{X} - Y\overline{Y}$, $U = 2\mathrm{Re}(X\overline{Y})$, $V = 2\mathrm{Im}(X\overline{Y})$, with each parameter being a 32-bit floating point number. +The XY polarisations can optionally be converted into \emph{Stokes IQUV} parameters, which represent the polarisation aspects in an alternative way. The Stokes parameters are defined as $I = X\overline{X} + Y\overline{Y}$, $Q = X\overline{X} - Y\overline{Y}$, $U = 2\mathrm{Re}(X\overline{Y})$, $V = 2\mathrm{Im}(X\overline{Y})$, with each parameter being a 32-bit floats. Both the XY polarisations and the Stokes IQUV parameters require up to 6.2~Gb/s per beam, which severely limits the number of beams that can be produced, and which represents a time resolution which is not always necessary. For example, in sky surveys, it is desirable to create many beams. The data rate per beam thus has to be lowered. For many-beam observations, we convert the XY polarisations into just the \emph{Stokes I} parameter, which represents the amplitude of the signal in the X and Y polarisations combined. The resulting data rate is 1.5~Gb/s per beam, but we also allow time-wise integration to further reduce the data rate with an integer factor, allowing even more beams to be created. @@ -189,7 +189,7 @@ The communications in the all-to-all exchange are asynchronous, which allows a c \subsection{Signal Processing} -Once a compute core receives a chunk, it can start processing. First, we convert the station data from 16-bit little-endian integers to 32-bit big-endian floating point numbers, in order to be able to do further processing using the powerful dual FPU units present in each core. The data doubles in size, which is the main reason why we implement it \emph{after} the exchange. +Once a compute core receives a chunk, it can start processing. First, we convert the station data from 16-bit little-endian integers to 32-bit big-endian floats, in order to be able to do further processing using the powerful dual FPU units present in each core. The data doubles in size, which is the main reason why we implement it \emph{after} the exchange. Next, the data are filtered by applying a Poly-Phase Filter (PPF) bank, which consists of a Finite Impulse Response (FIR) filter and a Fast-Fourier Transform (FFT). The FFT allows the chunk, which represents a subband of 195~kHz, to be split into narrower subbands (\emph{channels}). A higher frequency resolution allows more precise corrections in the frequency domain, such as the removal of radio interference at very specific frequencies. @@ -205,7 +205,7 @@ The beamformer creates the beams as described in Section \ref{Sec:Beamforming}. The delays are applied to the station data through complex multiplications and additions, programmed in assembly. In order to take full advantage of the L1 cache and the available registers, data is processed in sets of 6 stations, producing 3 beams, or a subset thereof to cover the remaining stations and beams. While the exact ideal set size in which the data is to be processed depends on the architecture at hand, we have shown in previous work that similar tradeoffs exist for similar problems across different architectures~\cite{Nieuwpoort:09,BAR}. -Because each beam is an accumulation of the data from all stations, the bandwidth of each beam is equal to the bandwidth of data from a single station, which is 6.2~Gb/s now that the samples are 32-bit floating point numbers. Once the beams are formed, they are kept as XY polarisations or transformed into the Stokes IQUV or the Stokes I parameters. In the latter case, the beams can also be integrated time-wise, in which groups of samples of fixed size are accumulated to reduce the resulting data rate. +Because each beam is an accumulation of the data from all stations, the bandwidth of each beam is equal to the bandwidth of data from a single station, which is 6.2~Gb/s now that the samples are 32-bit floats. Once the beams are formed, they are kept as XY polarisations or transformed into the Stokes IQUV or the Stokes I parameters. In the latter case, the beams can also be integrated time-wise, in which groups of samples of fixed size are accumulated to reduce the resulting data rate. The beamformer transforms chunks representing station data into chunks representing beam data. Because a chunk representing station data contained data for only one subband, the chunks representing different subbands of the same beam are still spread out over the full BG/P. Chunks corresponding to the same beam are brought together using a second all-to-all exchange. @@ -307,7 +307,7 @@ F & Stokes I & Y & 1 & 64 & 42 & 198 Gb/s & 65 Gb/s & CPU & Known sources \end{minipage} \end{figure} -We further analyse the workload of the compute cores by highlighting a set of cases, summarised in Table \ref{table:cases}. We will focus on a memory-bound case (A), which also creates the highest number of beams, on CPU-bound cases interesting for performing surveys, with either 24 stations (B) or 64 stations (C) as input. Cases D and E focus on high-resolution observations of known sources, and are I/O bound configurations with 24 and 64 stations, respectively. Case F focusses on the observations of known sources as well, using Stokes I output, which allows more beams to be created. Channel dedispersion requires the DM of the sources to be known, and is thus only enabled in the cases which observe known sources. +We further analyse the workload of the compute cores by highlighting a set of cases, summarised in Table \ref{table:cases}. We will focus on a memory-bound case (A), which also creates the highest number of beams, on CPU-bound cases interesting for performing surveys, with either 24 stations (B) or 64 stations (C) as input. Cases D and E focus on high-resolution observations of known sources, and are I/O bound configurations with 24 and 64 stations, respectively. Case F focusses on the observations of known sources as well, using Stokes I output, which allows more beams to be created. Channel-level dedispersion is applied for all cases that observe known sources. The workload of the compute cores for each case is shown in Figure \ref{fig:execution-times}, which shows the average workload per core. For the CPU-bound cases B and C, the average load has to be lower than 100\% in order to prevent fluctuations from slowing down our real-time system. These fluctuations typically occur due to clashes within the BG/P 3D-torus network which is used for both all-to-all-exchanges, and cannot be avoided in all cases. diff --git a/doc/papers/2011/europar/stations-beams.jgr b/doc/papers/2011/europar/stations-beams.jgr index 4338207da86dab87246788f530753a4832ef9c19..5aa178f28ee52d5b9d4ecf5d382a55bec61f6964 100644 --- a/doc/papers/2011/europar/stations-beams.jgr +++ b/doc/papers/2011/europar/stations-beams.jgr @@ -4,11 +4,26 @@ newgraph min 0 max 64 + size 2.7 yaxis label : max number of beams + size 2.5 no_auto_hash_labels + hash_label at 1 : 1 + hash_label at 3 : 9 + hash_label at 5 : 25 + hash_label at 7 : 49 + hash_label at 9 : 81 + hash_label at 11 : 121 + hash_label at 13 : 169 + hash_label at 15 : 225 + hash_label at 17 : 289 + hash_label at 19 : 361 + hash_label at 21 : 441 + hash_label at 23 : 529 +(* hash_label at 1 : 1 hash_label at 2 : 4 hash_label at 3 : 9 @@ -33,7 +48,7 @@ newgraph hash_label at 22 : 484 hash_label at 23 : 529 hash_label at 24 : 576 - +*) min 0 max 24 @@ -53,7 +68,7 @@ newline label : I/O bound legend - x 40 y 20 + x 38 y 20 linelength 5 newstring : Complex Voltages / Stokes IQUV @@ -166,7 +181,7 @@ newline newstring : Stokes I, \ 8x integration - x 2 y 18.5 + x 2 y 18.2 hjl vjc newline