Skip to content
Snippets Groups Projects
Commit 67acc7ed authored by Jan David Mol's avatar Jan David Mol
Browse files

bug 1362: paper update

parent e900dade
No related branches found
No related tags found
No related merge requests found
......@@ -21,8 +21,17 @@ awk -v N=1 '/->/ {a+=$3;b+=$5;c+=$7;d+=$9;e+=$11;f+=$13;n++; if(n==N) print "1 "
*)
newgraph
yaxis min 0 max 100 label : BG/P occupation (% CPU time)
xaxis min 0 max 7 hash 0 label : Case (see Table 1)
yaxis
min 0
max 100
size 2.5
label : BG/P occupation (% CPU time)
xaxis
min 0
max 7
hash 0
size 2.5
label : Case (see Table 1)
(*
hash_labels rotate 22 vjt hjr
hash_label at 1 : I, 16i, 4s, 543b
......@@ -89,5 +98,5 @@ newgraph
label : 1st all-to-all exchange & \
input handling
legend top defaults hjl linelength 75 x 4 y 90
legend top defaults hjl linelength 75 x 4 y 97
No preview for this file type
......@@ -161,9 +161,9 @@ A station focusses at a source by applying different delays to the signals from
The BG/P, which receives the signals from all stations, again performs delay compensation and beam forming, this time in software. The BG/P beam former can focus on sources anywhere in the fields of view of the station beams, creating \emph{tied-array beams} (beams). An example is shown in Figure \ref{fig:pencilbeams}, in which a station beam (represented by an ellipse) contains several tied-array beams (represented by hexagons). The actual width of the station beam, as well as the width of the tied-array beams, depends on the number as well as the locations of the stations used. Hundreds of tied-array beams are typically needed to fully cover the field of view of a station beam.
Different tied-array beams are created by adding the signals from the individual stations using different delays. The delays that have to be applied to obtain a tied-array beam depends on the relative positions of the stations and the relative direction of the tied-array beam with respect to the station beam. The delays are applied in two phases. First, the streams are aligned by shifting them a whole number of samples with respect to each other, which resolves delay differences up to the granularity of a single sample. Then, the remaining sub-sample delays are compensated for by shifting the phase of the signal. In order to obtain different tied-array beams, only the sub-sample delays have to be adjusted. A phase shift is performed by applying a complex multiplication. To form a beam, the beam former gathers the streams of samples from the stations, multiplies them with precomputed weights representing the required phase shift, and adds the streams together. The same weights are applied to both the X and the Y polarisations. The resulting data stream is called the \emph{XY polarisations}, and consists of 32-bit complex floating point numbers.
Different tied-array beams are created by adding the signals from the individual stations using different delays. The delays that have to be applied to obtain a tied-array beam depends on the relative positions of the stations and the relative direction of the tied-array beam with respect to the station beam. The delays are applied in two phases. First, the streams are aligned by shifting them a whole number of samples with respect to each other, which resolves delay differences up to the granularity of a single sample. Then, the remaining sub-sample delays are compensated for by shifting the phase of the signal. In order to obtain different tied-array beams, only the sub-sample delays have to be adjusted. A phase shift is performed by applying a complex multiplication. To form a beam, the beam former gathers the streams of samples from the stations, multiplies them with precomputed weights representing the required phase shift, and adds the streams together. The same weights are applied to both the X and the Y polarisations. The resulting data stream is called the \emph{XY polarisations}, and consists of 32-bit complex floating point numbers (complex floats).
The XY polarisations can optionally be converted into \emph{Stokes IQUV} parameters, which represent the polarisation aspects in an alternative way. The Stokes parameters are defined as $I = X\overline{X} + Y\overline{Y}$, $Q = X\overline{X} - Y\overline{Y}$, $U = 2\mathrm{Re}(X\overline{Y})$, $V = 2\mathrm{Im}(X\overline{Y})$, with each parameter being a 32-bit floating point number.
The XY polarisations can optionally be converted into \emph{Stokes IQUV} parameters, which represent the polarisation aspects in an alternative way. The Stokes parameters are defined as $I = X\overline{X} + Y\overline{Y}$, $Q = X\overline{X} - Y\overline{Y}$, $U = 2\mathrm{Re}(X\overline{Y})$, $V = 2\mathrm{Im}(X\overline{Y})$, with each parameter being a 32-bit floats.
Both the XY polarisations and the Stokes IQUV parameters require up to 6.2~Gb/s per beam, which severely limits the number of beams that can be produced, and which represents a time resolution which is not always necessary. For example, in sky surveys, it is desirable to create many beams. The data rate per beam thus has to be lowered. For many-beam observations, we convert the XY polarisations into just the \emph{Stokes I} parameter, which represents the amplitude of the signal in the X and Y polarisations combined. The resulting data rate is 1.5~Gb/s per beam, but we also allow time-wise integration to further reduce the data rate with an integer factor, allowing even more beams to be created.
......@@ -189,7 +189,7 @@ The communications in the all-to-all exchange are asynchronous, which allows a c
\subsection{Signal Processing}
Once a compute core receives a chunk, it can start processing. First, we convert the station data from 16-bit little-endian integers to 32-bit big-endian floating point numbers, in order to be able to do further processing using the powerful dual FPU units present in each core. The data doubles in size, which is the main reason why we implement it \emph{after} the exchange.
Once a compute core receives a chunk, it can start processing. First, we convert the station data from 16-bit little-endian integers to 32-bit big-endian floats, in order to be able to do further processing using the powerful dual FPU units present in each core. The data doubles in size, which is the main reason why we implement it \emph{after} the exchange.
Next, the data are filtered by applying a Poly-Phase Filter (PPF) bank, which consists of a Finite Impulse Response (FIR) filter and a Fast-Fourier Transform (FFT). The FFT allows the chunk, which represents a subband of 195~kHz, to be split into narrower subbands (\emph{channels}). A higher frequency resolution allows more precise corrections in the frequency domain, such as the removal of radio interference at very specific frequencies.
......@@ -205,7 +205,7 @@ The beamformer creates the beams as described in Section \ref{Sec:Beamforming}.
The delays are applied to the station data through complex multiplications and additions, programmed in assembly. In order to take full advantage of the L1 cache and the available registers, data is processed in sets of 6 stations, producing 3 beams, or a subset thereof to cover the remaining stations and beams. While the exact ideal set size in which the data is to be processed depends on the architecture at hand, we have shown in previous work that similar tradeoffs exist for similar problems across different architectures~\cite{Nieuwpoort:09,BAR}.
Because each beam is an accumulation of the data from all stations, the bandwidth of each beam is equal to the bandwidth of data from a single station, which is 6.2~Gb/s now that the samples are 32-bit floating point numbers. Once the beams are formed, they are kept as XY polarisations or transformed into the Stokes IQUV or the Stokes I parameters. In the latter case, the beams can also be integrated time-wise, in which groups of samples of fixed size are accumulated to reduce the resulting data rate.
Because each beam is an accumulation of the data from all stations, the bandwidth of each beam is equal to the bandwidth of data from a single station, which is 6.2~Gb/s now that the samples are 32-bit floats. Once the beams are formed, they are kept as XY polarisations or transformed into the Stokes IQUV or the Stokes I parameters. In the latter case, the beams can also be integrated time-wise, in which groups of samples of fixed size are accumulated to reduce the resulting data rate.
The beamformer transforms chunks representing station data into chunks representing beam data. Because a chunk representing station data contained data for only one subband, the chunks representing different subbands of the same beam are still spread out over the full BG/P. Chunks corresponding to the same beam are brought together using a second all-to-all exchange.
......@@ -307,7 +307,7 @@ F & Stokes I & Y & 1 & 64 & 42 & 198 Gb/s & 65 Gb/s & CPU & Known sources
\end{minipage}
\end{figure}
We further analyse the workload of the compute cores by highlighting a set of cases, summarised in Table \ref{table:cases}. We will focus on a memory-bound case (A), which also creates the highest number of beams, on CPU-bound cases interesting for performing surveys, with either 24 stations (B) or 64 stations (C) as input. Cases D and E focus on high-resolution observations of known sources, and are I/O bound configurations with 24 and 64 stations, respectively. Case F focusses on the observations of known sources as well, using Stokes I output, which allows more beams to be created. Channel dedispersion requires the DM of the sources to be known, and is thus only enabled in the cases which observe known sources.
We further analyse the workload of the compute cores by highlighting a set of cases, summarised in Table \ref{table:cases}. We will focus on a memory-bound case (A), which also creates the highest number of beams, on CPU-bound cases interesting for performing surveys, with either 24 stations (B) or 64 stations (C) as input. Cases D and E focus on high-resolution observations of known sources, and are I/O bound configurations with 24 and 64 stations, respectively. Case F focusses on the observations of known sources as well, using Stokes I output, which allows more beams to be created. Channel-level dedispersion is applied for all cases that observe known sources.
The workload of the compute cores for each case is shown in Figure \ref{fig:execution-times}, which shows the average workload per core. For the CPU-bound cases B and C, the average load has to be lower than 100\% in order to prevent fluctuations from slowing down our real-time system. These fluctuations typically occur due to clashes within the BG/P 3D-torus network which is used for both all-to-all-exchanges, and cannot be avoided in all cases.
......
......@@ -4,11 +4,26 @@ newgraph
min 0
max 64
size 2.7
yaxis
label : max number of beams
size 2.5
no_auto_hash_labels
hash_label at 1 : 1
hash_label at 3 : 9
hash_label at 5 : 25
hash_label at 7 : 49
hash_label at 9 : 81
hash_label at 11 : 121
hash_label at 13 : 169
hash_label at 15 : 225
hash_label at 17 : 289
hash_label at 19 : 361
hash_label at 21 : 441
hash_label at 23 : 529
(*
hash_label at 1 : 1
hash_label at 2 : 4
hash_label at 3 : 9
......@@ -33,7 +48,7 @@ newgraph
hash_label at 22 : 484
hash_label at 23 : 529
hash_label at 24 : 576
*)
min 0
max 24
......@@ -53,7 +68,7 @@ newline
label : I/O bound
legend
x 40 y 20
x 38 y 20
linelength 5
newstring : Complex Voltages / Stokes IQUV
......@@ -166,7 +181,7 @@ newline
newstring : Stokes I, \
8x integration
x 2 y 18.5
x 2 y 18.2
hjl vjc
newline
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment