Skip to content
Snippets Groups Projects
Commit 99c237ac authored by Jan David Mol's avatar Jan David Mol
Browse files

bug 1362: paper update

parent f23c0dc3
No related branches found
No related tags found
No related merge requests found
No preview for this file type
...@@ -28,8 +28,11 @@ GEN_FILES = $(AUX_FILES) $(GEN_FIGURES) lofar.pdf $(GEN_EXT:%=lofar.%)\ ...@@ -28,8 +28,11 @@ GEN_FILES = $(AUX_FILES) $(GEN_FIGURES) lofar.pdf $(GEN_EXT:%=lofar.%)\
TEXINPUTS = inputs:.: TEXINPUTS = inputs:.:
TEXFONTS = : TEXFONTS = :
%.pdf: %.jgr %.eps: %.jgr
jgraph $< | epstopdf --filter > $@ jgraph $< > $@
%.pdf: %.eps
epstopdf --filter < $< > $@
%.pdf: %.fig %.pdf: %.fig
fig2dev -L pdf $< $@ fig2dev -L pdf $< $@
......
...@@ -28,7 +28,7 @@ newgraph ...@@ -28,7 +28,7 @@ newgraph
x 0.5 y 1.2 x 0.5 y 1.2
newline newline
label : Without dedispersion label : No channel-level dedispersion
linetype dashed linetype dashed
linethickness 2.0 linethickness 2.0
color 0 0 1 color 0 0 1
...@@ -42,10 +42,10 @@ newline ...@@ -42,10 +42,10 @@ newline
0.04283439 0.18166798 0.20039363 0.43340311 0.61501703 0.80407534\ 0.04283439 0.18166798 0.20039363 0.43340311 0.61501703 0.80407534\
0.83333333 0.75695429 0.69960432 0.66769064 0.65099889 0.45159893\ 0.83333333 0.75695429 0.69960432 0.66769064 0.65099889 0.45159893\
0.43001391 0.30665736 0.12644565 0.01286941 0.084529 0.02011669;\ 0.43001391 0.30665736 0.12644565 0.01286941 0.084529 0.02011669;\
do echo $N / 23 | bc -l; echo "($i + 0.13795437)/(0.83333333 + 0.13795437)" | bc -l; N=$[$N+1]; done do echo $N / 23 | bc -l; echo "($i + 0.13795437)/(0.83333333 + 0.13795437)" | bc -l; N=$(($N+1)); done
newline newline
label : With dedispersion label : Channel-level dedispersion
linetype solid linetype solid
linethickness 2.0 linethickness 2.0
color 1 0 0 color 1 0 0
...@@ -60,5 +60,5 @@ newline ...@@ -60,5 +60,5 @@ newline
-0.05202656 -0.04786229 0.01148896 0.39690428 0.71437241 0.83333333\ -0.05202656 -0.04786229 0.01148896 0.39690428 0.71437241 0.83333333\
0.62925924 0.49068521 0.41371978 0.49633274 0.53869321 0.46915986\ 0.62925924 0.49068521 0.41371978 0.49633274 0.53869321 0.46915986\
0.23657243 0.14340703 0.06047267 0.01298272 -0.06736764 0.01427185;\ 0.23657243 0.14340703 0.06047267 0.01298272 -0.06736764 0.01427185;\
do echo $N / 23 | bc -l; echo "($i + 0.07190334)/(0.83333333 + 0.07190334)" | bc -l; N=$[$N+1]; done do echo $N / 23 | bc -l; echo "($i + 0.07190334)/(0.83333333 + 0.07190334)" | bc -l; N=$(($N+1)); done
No preview for this file type
...@@ -154,7 +154,7 @@ We customised the I/O node software stack~\cite{Yoshii:10} and run a multi-threa ...@@ -154,7 +154,7 @@ We customised the I/O node software stack~\cite{Yoshii:10} and run a multi-threa
- external networks and storage nodes - external networks and storage nodes
} }
\section{Beamforming} \section{Beam Forming}
\label{Sec:Beamforming} \label{Sec:Beamforming}
A station focusses at a source by applying different delays to the signals from its antennas, and subsequently adding the delayed signals. Delaying the signals is known as \emph{delay compensation}, and subsequently adding them as \emph{beam forming}. The station beam former is implemented in hardware, in which the signal is delayed by switching it over wires of different lengths. The signals from the different antennas are subsequently added using an FPGA. The resulting \emph{station beam}, even though it is focussed on a certain source, still has a wide field of view. A station focusses at a source by applying different delays to the signals from its antennas, and subsequently adding the delayed signals. Delaying the signals is known as \emph{delay compensation}, and subsequently adding them as \emph{beam forming}. The station beam former is implemented in hardware, in which the signal is delayed by switching it over wires of different lengths. The signals from the different antennas are subsequently added using an FPGA. The resulting \emph{station beam}, even though it is focussed on a certain source, still has a wide field of view.
...@@ -163,11 +163,11 @@ The BG/P, which receives the signals from all stations, again performs delay com ...@@ -163,11 +163,11 @@ The BG/P, which receives the signals from all stations, again performs delay com
Different tied-array beams are created by adding the signals from the individual stations using different delays. The delays that have to be applied to obtain a tied-array beam depends on the relative positions of the stations and the relative direction of the tied-array beam with respect to the station beam. The delays are applied in two phases. First, the streams are aligned by shifting them a whole number of samples with respect to each other, which resolves delay differences up to the granularity of a single sample. Then, the remaining sub-sample delays are compensated for by shifting the phase of the signal. In order to obtain different tied-array beams, only the sub-sample delays have to be adjusted. A phase shift is performed by applying a complex multiplication. To form a beam, the beam former gathers the streams of samples from the stations, multiplies them with precomputed weights representing the required phase shift, and adds the streams together. The same weights are applied to both the X and the Y polarisations. The resulting data stream is called the \emph{XY polarisations}, and consists of 32-bit complex floating point numbers. Different tied-array beams are created by adding the signals from the individual stations using different delays. The delays that have to be applied to obtain a tied-array beam depends on the relative positions of the stations and the relative direction of the tied-array beam with respect to the station beam. The delays are applied in two phases. First, the streams are aligned by shifting them a whole number of samples with respect to each other, which resolves delay differences up to the granularity of a single sample. Then, the remaining sub-sample delays are compensated for by shifting the phase of the signal. In order to obtain different tied-array beams, only the sub-sample delays have to be adjusted. A phase shift is performed by applying a complex multiplication. To form a beam, the beam former gathers the streams of samples from the stations, multiplies them with precomputed weights representing the required phase shift, and adds the streams together. The same weights are applied to both the X and the Y polarisations. The resulting data stream is called the \emph{XY polarisations}, and consists of 32-bit complex floating point numbers.
The XY polarisations can optionally be converted into \emph{Stokes IQUV} parameters, which represent the polarisation aspects in an alternative way. The Stokes parameters are defined as $I = X\overline{X} + Y\overline{Y}$, $Q = X\overline{X} - Y\overline{Y}$, $U = 2\mathrm{Re}(X\overline{Y})$, $V = 2\mathrm{Im}(X\overline{Y})$, with each parameter being a 32-bit real floating point number. The XY polarisations can optionally be converted into \emph{Stokes IQUV} parameters, which represent the polarisation aspects in an alternative way. The Stokes parameters are defined as $I = X\overline{X} + Y\overline{Y}$, $Q = X\overline{X} - Y\overline{Y}$, $U = 2\mathrm{Re}(X\overline{Y})$, $V = 2\mathrm{Im}(X\overline{Y})$, with each parameter being a 32-bit floating point number.
Both the XY polarisations and the Stokes IQUV parameters require up to 6.2~Gb/s per beam, which severely limits the number of beams that can be produced, and which represents a time resolution which is not always necessary. For example, in sky surveys, it is desirable to create many beams. The data rate per beam thus has to be lowered. For many-beam observations, we convert the XY polarisations into just the \emph{Stokes I} parameter, which represents the amplitude of the signal in the X and Y polarisations combined. The resulting data rate is 1.5~Gb/s per beam, but we also allow time-wise integration to further reduce the data rate with an integer factor, allowing even more beams to be created. Both the XY polarisations and the Stokes IQUV parameters require up to 6.2~Gb/s per beam, which severely limits the number of beams that can be produced, and which represents a time resolution which is not always necessary. For example, in sky surveys, it is desirable to create many beams. The data rate per beam thus has to be lowered. For many-beam observations, we convert the XY polarisations into just the \emph{Stokes I} parameter, which represents the amplitude of the signal in the X and Y polarisations combined. The resulting data rate is 1.5~Gb/s per beam, but we also allow time-wise integration to further reduce the data rate with an integer factor, allowing even more beams to be created.
For each beam, each polarisation or Stokes parameter is stored in a separate file. If too many I/O nodes are limited to 1.1~Gb/s of output, full polarisation or Stokes parameter streams are too wide to transport. In such cases, we split the streams into several substreams, each holding 83 or 124 subbands per stream instead of 248. For each beam, each polarisation or Stokes parameter is stored in a separate file. If too many I/O nodes are limited to 1.1~Gb/s of output, full polarisation or Stokes parameter streams are too wide to transport. In such cases, we split the streams into \emph{slices} of 83 or 124 subbands per substream instead of 248.
% TODO: incoherent stokes % TODO: incoherent stokes
\section{Pulsar Pipeline} \section{Pulsar Pipeline}
...@@ -187,7 +187,7 @@ The first all-to-all exchange in the pipeline allows the compute cores to distri ...@@ -187,7 +187,7 @@ The first all-to-all exchange in the pipeline allows the compute cores to distri
The communications in the all-to-all exchange are asynchronous, which allows a compute core to start processing a subband from a station as soon as it arrives, up to the point that data from all stations are required. Communication and computation are thus overlapped as much as possible. The communications in the all-to-all exchange are asynchronous, which allows a compute core to start processing a subband from a station as soon as it arrives, up to the point that data from all stations are required. Communication and computation are thus overlapped as much as possible.
\subsection{Pre-beamforming Signal Processing} \subsection{Signal Processing}
Once a compute core receives a chunk, it can start processing. First, we convert the station data from 16-bit little-endian integers to 32-bit big-endian floating point numbers, in order to be able to do further processing using the powerful dual FPU units present in each core. The data doubles in size, which is the main reason why we implement it \emph{after} the exchange. Once a compute core receives a chunk, it can start processing. First, we convert the station data from 16-bit little-endian integers to 32-bit big-endian floating point numbers, in order to be able to do further processing using the powerful dual FPU units present in each core. The data doubles in size, which is the main reason why we implement it \emph{after} the exchange.
...@@ -199,11 +199,11 @@ Next, a band pass correction is applied to adjust the signal strengths in all ch ...@@ -199,11 +199,11 @@ Next, a band pass correction is applied to adjust the signal strengths in all ch
Up to this point, processing chunks from different stations can be done independently, but from here on, the data from all stations are required. The first all-to-all exchange thus ends here. Up to this point, processing chunks from different stations can be done independently, but from here on, the data from all stations are required. The first all-to-all exchange thus ends here.
\subsection{Beamforming} \subsection{Beam Forming}
The beamformer creates the beams as described in Section \ref{Sec:Beamforming}. First, the different weights required for the different beams are computed, based on the station positions and the beam directions. Note that the data in the chunks are already delay compensated with respect to the source at which the stations are pointed. Any delay compensation performed by the beamformer is therefore to compensate the delay differences between the desired beams and the station's source. The reason for this two-stage approach is flexibility. By already compensating for the station's source in the previous step, the resulting data can not only be fed to the beamformer, but also to other pipelines, such as the imaging pipeline. Because we have a software pipeline, we can implement and connect different processing pipelines with only a small increase in complexity. The beamformer creates the beams as described in Section \ref{Sec:Beamforming}. First, the different weights required for the different beams are computed, based on the station positions and the beam directions. Note that the data in the chunks are already delay compensated with respect to the source at which the stations are pointed. Any delay compensation performed by the beamformer is therefore to compensate the delay differences between the desired beams and the station's source. The reason for this two-stage approach is flexibility. By already compensating for the station's source in the previous step, the resulting data can not only be fed to the beamformer, but also to other pipelines, such as the imaging pipeline. Because we have a software pipeline, we can implement and connect different processing pipelines with only a small increase in complexity.
The delays are applied to the station data through complex multiplications and additions, programmed in assembly. In order to take full advantage of the L1 cache and the available registers, data is processed in sets of 6 stations, producing 3 beams, or a subset thereof to cover the remaining stations and beams. While the exact ideal set size in which the data is to be processed depends on the architecture at hand, we have shown in previous work that similar tradeoffs exist for similar problems across different architectures~\cite{FOO,BAR}. The delays are applied to the station data through complex multiplications and additions, programmed in assembly. In order to take full advantage of the L1 cache and the available registers, data is processed in sets of 6 stations, producing 3 beams, or a subset thereof to cover the remaining stations and beams. While the exact ideal set size in which the data is to be processed depends on the architecture at hand, we have shown in previous work that similar tradeoffs exist for similar problems across different architectures~\cite{Nieuwpoort:09,BAR}.
Because each beam is an accumulated combination of the data from all stations, the bandwidth of each beam is equal to the bandwidth of data from a single station, which is 6.2~Gb/s now that the samples are 32-bit floating point numbers. Once the beams are formed, they are kept as XY polarisations or transformed into the Stokes IQUV or the Stokes I parameters. In the latter case, the beams can also be integrated time-wise, in which groups of samples of fixed size are accumulated together to reduce the resulting data rate. Because each beam is an accumulated combination of the data from all stations, the bandwidth of each beam is equal to the bandwidth of data from a single station, which is 6.2~Gb/s now that the samples are 32-bit floating point numbers. Once the beams are formed, they are kept as XY polarisations or transformed into the Stokes IQUV or the Stokes I parameters. In the latter case, the beams can also be integrated time-wise, in which groups of samples of fixed size are accumulated together to reduce the resulting data rate.
...@@ -211,27 +211,25 @@ The beamformer transforms chunks representing station data into chunks represent ...@@ -211,27 +211,25 @@ The beamformer transforms chunks representing station data into chunks represent
\subsection{Channel-level Dedispersion} \subsection{Channel-level Dedispersion}
Another major component in the pulsar-observation pipeline is real-time dedispersion. Since light of a high frequency travels faster through the interstellar medium than light of a lower frequency, the arrival time of a pulse differs for different wave lengths. To combine data from multiple frequency channels, the channels must be aligned (shifted in time). Otherwise, the pulse will be smeared, causing any details to be lost. This process, called \emph{dedispersion}, is done by post-processing software that runs after the observation has finished. However, to observe at the lowest frequencies, or to observe fast-rotating millisecond pulsars, dedispersion must also be performed \emph{within\/} a channel, since our channels (typically 12~KHz) are too wide to ignore dispersion. Another major component in the pulsar-observation pipeline is real-time dedispersion. Since light of a high frequency travels faster through the interstellar medium than light of a lower frequency, the arrival time of a pulse differs for different wave lengths. To combine data from multiple frequency channels, the channels must be aligned (shifted in time). Otherwise, the pulse will be smeared or even overlap with the next pulse, causing many details to be lost. This process, called \emph{dedispersion}, is done by post-processing software that runs after the observation has finished. However, to observe at the lowest frequencies, or to observe fast-rotating millisecond pulsars, dedispersion must also be performed \emph{within\/} a channel, since our channels (typically 12~KHz) are too wide to ignore dispersion.
Dedispersion is performed in the frequency domain, effectively by doing a 4K~Fourier transform that splits a 12~KHz channel into 3~Hz subchannels. The phases of the observed samples are corrected by applying a Chirp function~\cite{...}, i.e., by multiplication with precomputed, subchannel-dependent, complex weights. These multiplications are programmed in assembly, to reduce the computational costs. A backward FFT is done to revert to 12~KHz channels. Dedispersion is performed in the frequency domain, effectively by doing a 4K~Fourier transform (FFT) that splits a 12~KHz channel into 3~Hz subchannels. The phases of the observed samples are corrected by applying a Chirp function~\cite{...}, i.e., by multiplication with precomputed, subchannel-dependent, complex weights. These multiplications are programmed in assembly, to reduce the computational costs. A backward FFT is done to revert to 12~KHz channels.
\begin{figure}[ht] \begin{figure}[ht]
\includegraphics[width=0.5\textwidth]{coherent-dedispersion.pdf} \includegraphics[width=0.5\textwidth]{coherent-dedispersion.pdf}
\label{fig:dedispersion-result} \label{fig:dedispersion-result}
\caption{Pulse profiles of pulsar J0034-0534, with and without dedispersion.} \caption{Pulse profiles of pulsar J0034-0534, with and without dedispersion applied at channel level.}
\end{figure} \end{figure}
Figure~\ref{fig:dedispersion-result} shows the effectiveness of channel-level dedispersion, where we observed pulsar J0034-0534 with a pulse period of 1.88~ms with and without dedispersion. By applying dedispersion, the effective time resolution is improved from 0.51~ms to 0.082~ms, revealing a narrower, more detailed pulse above a lower noise floor. Dedispersion thus contributes significantly to the data quality of the LOFAR telescope, but it also comes at a significant computational cost due to the two FFTs it requires. It also demonstrates the power of using a \emph{software\/} telescope; the pipeline component was implemented, verified, and optimized in only one month time. Figure~\ref{fig:dedispersion-result} shows the effectiveness of channel-level dedispersion, where we observed pulsar J0034-0534 with a pulse period of 1.88~ms. By applying dedispersion, the effective time resolution is improved from 0.51~ms to 0.082~ms, revealing a narrower, more detailed pulse and a better signal-to-noise ratio. Dedispersion thus contributes significantly to the data quality, but it also comes at a significant computational cost due to the two FFTs it requires. It demonstrates the power of using a \emph{software\/} telescope: the pipeline component was implemented, verified, and optimized in only one month time.
\subsection{Second All-to-all Exchange} \subsection{Second All-to-all Exchange}
In the second all-to-all exchange, the chunks made by the beamformer are rearranged again using the 3D-torus network. Due to memory constrains on the compute cores, the cores that performed the beamforming cannot be the same cores that receive the beam data after the exchange. Instead, we assign a set of cores (\emph{output cores}) to receive the rearranged chunks of beam data. The output cores are chosen before an observation, and are never scheduled to do participate in the earlier computations in the pipeline (which are done by the \emph{input cores}). In the second all-to-all exchange, the chunks made by the beamformer are again exchanged over the 3D-torus network. Due to memory constrains on the compute cores, the cores that performed the beam forming cannot be the same cores that receive the beam data after the exchange. We assign a set of cores (\emph{output cores}) to receive the chunks. The output cores are chosen before an observation, and are distinct from the \emph{input cores} which perform the earlier computations in the pipeline.
An output core first gathers chunks of beam data that belong to the same beam but represent different subbands. Then, it puts the data in the final ordering, which means both sorting the received chunks and reordering the dimensions of the data within a chunk. Reordering the data within a chunk is necessary, because the data order that will be written to disk is not the same order that can be produced by our computations without taking heavy L1 cache penalties. We hide this reordering cost at the output cores by overlapping computation (the reordering of a chunk) with communication (the arrival of other chunks). Once all of the chunks are received and reordered, they are sent back to the I/O node using the tree network. An output core gathers the chunks that contain different subbands but belong to the same slice (see Section \ref{Sec:Beamforming}). Then, it rearranges the dimensions of the data into their final ordering, which is necessary, because the data order that will be written to disk is not the same order that can be produced by our computations without taking heavy L1 cache penalties. We hide this reordering cost at the output cores by overlapping computation (the reordering of a chunk) with communication (the arrival of other chunks). Once all of the chunks are received and reordered, they are sent back to the I/O node.
For the distribution of the workload over the available output cores, three factors have to be considered. First, all of the data belonging to the same beam has to be processed by output cores in the same pset, in order to ensure that its I/O node can concatenate all of the 0.25 second chunks that belong to the beam. Second, the maximal output rate of the I/O node has to be considered. As mentioned in Section \ref{Sec:Networks}, an I/O node can output 1.1~Gb/s if it is processing station input data, and 3.1~Gb/s if it is not. However, a complex-voltages beam is 6.2~Gb/s. We therefor optionally split each beam into two or more substreams, each of which will be output by a separate I/O node and each of which will eventually end up as a seperate file on disk. For example, we store the X and Y polarisation data of a complex-voltages beam in two separate files. If there are not enough I/O nodes which do not process station input data, the X and Y polarisations have to be split further, in which case we store 82 or 83 subbands per file, with 3 files per polarisation. For the distribution of the workload over the available output cores, three factors have to be considered. First, all of the data belonging to the same beam has to be processed by output cores in the same pset, to ensure that one I/O node can concatenate all of the 0.25 second chunks that belong to the beam. Second, the maximum output rate per I/O node has to be respected. Finally, the presence of the first all-to-all exchange, which uses the same network at up to 198~Gb/s. The second exchange uses up to 80~Gb/s. Even though each link sustains 3.4~Gb/s, it has to process the traffic from four cores, as well as traffic routed through it between other nodes. The network links in the BG/P become overloaded unless enough output cores are used to spread the load.
A third factor in the scheduling of the output cores is the presence of the first all-to-all-exchange, which uses the same network at up to 198~Gb/s. The second exchange uses up to 80~Gb/s. Careful planning of the workload distribution is necessary, because the 3D-torus network routes data over other compute nodes, with each link sustaining 3.4~Gb/s. Some network links in the BG/P become overloaded unless enough output cores are used to spread the load.
\subsection{Transport to Disks} \subsection{Transport to Disks}
Once an output core has received and reordered all of its data, the data are sent to the core's I/O node. The I/O node forwards the data over TCP/IP to the storage cluster. To avoid any stalling in our pipeline due to network congestion or disk issues, the I/O node uses a best-effort buffer which drops data if it cannot be sent. Once an output core has received and reordered all of its data, the data are sent to the core's I/O node. The I/O node forwards the data over TCP/IP to the storage cluster. To avoid any stalling in our pipeline due to network congestion or disk issues, the I/O node uses a best-effort buffer which drops data if it cannot be sent.
...@@ -283,7 +281,21 @@ F & Stokes I & Y & 1 & 64 & 42 & 198 Gb/s & 65 Gb/s & CPU & Known sources ...@@ -283,7 +281,21 @@ F & Stokes I & Y & 1 & 64 & 42 & 198 Gb/s & 65 Gb/s & CPU & Known sources
\label{table:cases} \label{table:cases}
\end{table} \end{table}
We further analyse the workload of the compute cores by highlighting a set of interesting cases, summarised in Table \ref{table:cases}. We will focus on a memory-bound case (A), which also creates the highest number of beams, on CPU-bound cases interesting for performing surveys, with either 24 stations (B) or 64 stations (C) as input. Cases D and E focus on high-resolution observations of known sources, and are I/O bound configurations with 24 and 64 stations, respectively. Case F focusses on the observations of known sources as well, using Stokes I output, which allows more beams to be created. Channel dedispersion requires the DM of the sources to be known, and is thus only enabled in the cases which observe known sources. \begin{figure}[ht]
\begin{minipage}[t]{0.47\textwidth}
\includegraphics[width=\textwidth]{stations-beams.pdf}
\label{fig:stations-beams}
\caption{The maximum number of beams that can be created in various configurations.}
\end{minipage}
\hfill
\begin{minipage}[t]{0.47\textwidth}
\includegraphics[width=\textwidth]{execution_times.pdf}
\label{fig:execution-times}
\caption{The time spent in the processing steps.}
\end{minipage}
\end{figure}
We further analyse the workload of the compute cores by highlighting a set of cases, summarised in Table \ref{table:cases}. We will focus on a memory-bound case (A), which also creates the highest number of beams, on CPU-bound cases interesting for performing surveys, with either 24 stations (B) or 64 stations (C) as input. Cases D and E focus on high-resolution observations of known sources, and are I/O bound configurations with 24 and 64 stations, respectively. Case F focusses on the observations of known sources as well, using Stokes I output, which allows more beams to be created. Channel dedispersion requires the DM of the sources to be known, and is thus only enabled in the cases which observe known sources.
The workload of the compute cores for each case is shown in Figure \ref{fig:execution-times}, which shows the average workload per core. For the CPU-bound cases B and C, the average load has to be lower than 100\% in order to prevent fluctuations from slowing down our real-time system. These fluctuations typically occur due to clashes within the BG/P 3D-torus network which is used for both all-to-all-exchanges, and cannot be avoided in all cases. The workload of the compute cores for each case is shown in Figure \ref{fig:execution-times}, which shows the average workload per core. For the CPU-bound cases B and C, the average load has to be lower than 100\% in order to prevent fluctuations from slowing down our real-time system. These fluctuations typically occur due to clashes within the BG/P 3D-torus network which is used for both all-to-all-exchanges, and cannot be avoided in all cases.
...@@ -291,7 +303,7 @@ The cases which create many beams (A-C) spend most of the cycles performing beam ...@@ -291,7 +303,7 @@ The cases which create many beams (A-C) spend most of the cycles performing beam
The costs for both the first and the second all-to-all exchange are mostly hidden due to overlaps with computation. The remaining cost for the second exchange is proportional to the output bandwidth required in each case. The costs for both the first and the second all-to-all exchange are mostly hidden due to overlaps with computation. The remaining cost for the second exchange is proportional to the output bandwidth required in each case.
For the I/O-bound cases D-F, only a few tied-array beams are formed and transformed into Stokes I(QUV) parameters, which produces a lot of data but requires little CPU time. Enough CPU time is therefore avaialable to include channel dedispersion, which scales with the number of beams and, as Figure \ref{fig:execution-times} shows, is an expensive operation. For the I/O-bound cases D-F, only a few tied-array beams are formed and transformed into Stokes I(QUV) parameters, which produces a lot of data but requires little CPU time. Enough CPU time is therefore avaialable to include channel-level dedispersion, which scales with the number of beams and, as Figure \ref{fig:execution-times} shows, is an expensive operation.
\comment{ \comment{
- hit CPU, memory and I/O bounds - hit CPU, memory and I/O bounds
...@@ -300,20 +312,6 @@ For the I/O-bound cases D-F, only a few tied-array beams are formed and transfor ...@@ -300,20 +312,6 @@ For the I/O-bound cases D-F, only a few tied-array beams are formed and transfor
} }
\begin{figure}[ht]
\begin{minipage}[t]{0.47\textwidth}
\includegraphics[width=\textwidth]{stations-beams.pdf}
\label{fig:stations-beams}
\caption{The maximum number of beams that can be created in various configurations.}
\end{minipage}
\hfill
\begin{minipage}[t]{0.47\textwidth}
\includegraphics[width=\textwidth]{execution_times.pdf}
\label{fig:execution-times}
\caption{The time spent in the processing steps.}
\end{minipage}
\end{figure}
\comment{ \comment{
Intro belooft: Intro belooft:
- performance - performance
...@@ -328,8 +326,6 @@ For the I/O-bound cases D-F, only a few tied-array beams are formed and transfor ...@@ -328,8 +326,6 @@ For the I/O-bound cases D-F, only a few tied-array beams are formed and transfor
- network bw use in 2nd transpose - network bw use in 2nd transpose
} }
\cite{Hessels:09}
\section{Discussion} \section{Discussion}
\comment{ \comment{
......
...@@ -13,12 +13,12 @@ ...@@ -13,12 +13,12 @@
xmlns="http://www.w3.org/2000/svg" xmlns="http://www.w3.org/2000/svg"
xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
width="12.6in" width="1295.1581"
height="13.1in" height="1074.5101"
viewBox="17009 5954 15113 15680" viewBox="17009 5954 17260.779 14290.346"
id="svg2" id="svg2"
version="1.1" version="1.1"
inkscape:version="0.47pre4 r22446" inkscape:version="0.48.0 r9654"
sodipodi:docname="pencilbeams.svg"> sodipodi:docname="pencilbeams.svg">
<metadata <metadata
id="metadata54"> id="metadata54">
...@@ -28,7 +28,7 @@ ...@@ -28,7 +28,7 @@
<dc:format>image/svg+xml</dc:format> <dc:format>image/svg+xml</dc:format>
<dc:type <dc:type
rdf:resource="http://purl.org/dc/dcmitype/StillImage" /> rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
<dc:title></dc:title> <dc:title />
</cc:Work> </cc:Work>
</rdf:RDF> </rdf:RDF>
</metadata> </metadata>
...@@ -51,21 +51,37 @@ ...@@ -51,21 +51,37 @@
guidetolerance="10" guidetolerance="10"
inkscape:pageopacity="0" inkscape:pageopacity="0"
inkscape:pageshadow="2" inkscape:pageshadow="2"
inkscape:window-width="1048" inkscape:window-width="1022"
inkscape:window-height="829" inkscape:window-height="829"
id="namedview50" id="namedview50"
showgrid="false" showgrid="false"
inkscape:zoom="0.80067854" inkscape:zoom="0.35355339"
inkscape:cx="972.28554" inkscape:cx="746.49403"
inkscape:cy="889.24576" inkscape:cy="466.33141"
inkscape:window-x="0" inkscape:window-x="0"
inkscape:window-y="830" inkscape:window-y="397"
inkscape:window-maximized="0" inkscape:window-maximized="0"
inkscape:current-layer="svg2" /> inkscape:current-layer="svg2"
fit-margin-top="0"
fit-margin-left="0"
fit-margin-right="0"
fit-margin-bottom="0" />
<polygon <polygon
id="polygon8" id="polygon8"
style="fill:#e3e3e3;stroke:#656565;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter;fill-opacity:1;stroke-opacity:1" style="fill:#e3e3e3;fill-opacity:1;stroke:#656565;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
points="31477,12361 25778,6452 25778,6452 17385,21259 " /> points="31477,12361 25778,6452 25778,6452 17385,21259 "
transform="matrix(0.97528983,0.22092926,-0.22092926,0.97528983,5607.8521,-5842.3937)" />
<path
style="color:#000000;fill:#e3e3e3;fill-opacity:1;fill-rule:nonzero;stroke:#656565;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;stroke-dashoffset:0;marker:none;visibility:visible;display:inline;overflow:visible;enable-background:accumulate"
d="M 29131.927,6325.4432 C 20810.207,19254.117 20772.512,18726.388 20772.512,18726.388 l 13054.904,-5705.329 z"
id="path3025"
inkscape:connector-curvature="0"
sodipodi:nodetypes="cccc" />
<polygon
points="31477,12361 25778,6452 25778,6452 17385,21259 "
style="fill:none;stroke:#656565;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
id="polygon3192"
transform="matrix(0.97528983,0.22092926,-0.22092926,0.97528983,5607.8521,-5842.3937)" />
<ellipse <ellipse
sodipodi:ry="1725" sodipodi:ry="1725"
sodipodi:rx="4125" sodipodi:rx="4125"
...@@ -78,100 +94,120 @@ ...@@ -78,100 +94,120 @@
style="fill:#ffffe0;stroke:#656565;stroke-width:95;stroke-opacity:1" style="fill:#ffffe0;stroke:#656565;stroke-width:95;stroke-opacity:1"
ry="1725" ry="1725"
rx="4125" rx="4125"
transform="matrix(0.70711587,0.70709769,-0.70709769,0.70711587,28741,9310)" /> transform="matrix(0.53342435,0.84584777,-0.84584777,0.53342435,31581.805,9587.2832)" />
<polygon <polygon
id="polygon12" id="polygon12"
style="fill:#ffe0e0;stroke:#ffff00;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ffff00;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="27590,9125 26920,8455 26828,7880 27403,7974 27403,7974 28073,8640 28165,9217 " points="26828,7880 27403,7974 27403,7974 28073,8640 28165,9217 27590,9125 26920,8455 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon14" id="polygon14"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="27315,7397 26643,6729 26553,6154 27128,6246 27128,6246 27796,6914 27886,7491 " points="26553,6154 27128,6246 27128,6246 27796,6914 27886,7491 27315,7397 26643,6729 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon16" id="polygon16"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="26828,7880 26162,7212 26066,6637 26643,6729 26643,6729 27315,7397 27403,7974 " points="26066,6637 26643,6729 26643,6729 27315,7397 27403,7974 26828,7880 26162,7212 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon18" id="polygon18"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="26345,8365 25677,7699 25587,7122 26162,7212 26162,7212 26828,7880 26920,8455 " points="25587,7122 26162,7212 26162,7212 26828,7880 26920,8455 26345,8365 25677,7699 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon20" id="polygon20"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="27107,9610 26439,8940 26345,8365 26920,8455 26920,8455 27590,9125 27680,9700 " points="26345,8365 26920,8455 26920,8455 27590,9125 27680,9700 27107,9610 26439,8940 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon22" id="polygon22"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="28073,8640 27403,7974 27315,7397 27886,7491 27886,7491 28556,8157 28653,8732 " points="27315,7397 27886,7491 27886,7491 28556,8157 28653,8732 28073,8640 27403,7974 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon24" id="polygon24"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="28556,8157 27886,7491 27796,6914 28371,7006 28371,7006 29041,7674 29131,8249 " points="27796,6914 28371,7006 28371,7006 29041,7674 29131,8249 28556,8157 27886,7491 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon26" id="polygon26"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="29803,8919 29131,8249 29041,7674 29614,7766 29614,7766 30282,8436 30376,9009 " points="29041,7674 29614,7766 29614,7766 30282,8436 30376,9009 29803,8919 29131,8249 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon28" id="polygon28"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="29318,9400 28653,8732 28556,8157 29131,8249 29131,8249 29803,8919 29891,9492 " points="28556,8157 29131,8249 29131,8249 29803,8919 29891,9492 29318,9400 28653,8732 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon30" id="polygon30"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="28833,9883 28165,9217 28073,8640 28653,8732 28653,8732 29318,9400 29408,9980 " points="28073,8640 28653,8732 28653,8732 29318,9400 29408,9980 28833,9883 28165,9217 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon32" id="polygon32"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="28348,10368 27680,9700 27590,9125 28165,9217 28165,9217 28833,9883 28923,10460 " points="27590,9125 28165,9217 28165,9217 28833,9883 28923,10460 28348,10368 27680,9700 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon34" id="polygon34"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="27867,10851 27197,10183 27107,9610 27680,9700 27680,9700 28348,10368 28445,10945 " points="27107,9610 27680,9700 27680,9700 28348,10368 28445,10945 27867,10851 27197,10183 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon36" id="polygon36"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="29112,11609 28445,10945 28348,10368 28923,10460 28923,10460 29595,11130 29686,11706 " points="28348,10368 28923,10460 28923,10460 29595,11130 29686,11706 29112,11609 28445,10945 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon38" id="polygon38"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="30076,10645 29408,9980 29318,9400 29891,9492 29891,9492 30559,10164 30654,10735 " points="29318,9400 29891,9492 29891,9492 30559,10164 30654,10735 30076,10645 29408,9980 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon40" id="polygon40"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="30559,10164 29891,9492 29803,8919 30376,9009 30376,9009 31044,9679 31137,10253 " points="29803,8919 30376,9009 30376,9009 31044,9679 31137,10253 30559,10164 29891,9492 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon42" id="polygon42"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="31321,11405 30654,10735 30559,10164 31137,10253 31137,10253 31806,10922 31895,11500 " points="30559,10164 31137,10253 31137,10253 31806,10922 31895,11500 31321,11405 30654,10735 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon44" id="polygon44"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="30836,11890 30171,11220 30076,10645 30654,10735 30654,10735 31321,11405 31414,11981 " points="30076,10645 30654,10735 30654,10735 31321,11405 31414,11981 30836,11890 30171,11220 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon46" id="polygon46"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="29595,11130 28923,10460 28833,9883 29408,9980 29408,9980 30076,10645 30171,11220 " points="28833,9883 29408,9980 29408,9980 30076,10645 30171,11220 29595,11130 28923,10460 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<polygon <polygon
id="polygon48" id="polygon48"
style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter" style="fill:#ffe0e0;stroke:#ff0000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter"
points="30353,12373 29686,11706 29595,11130 30171,11220 30171,11220 30836,11890 30931,12466 " points="29595,11130 30171,11220 30171,11220 30836,11890 30931,12466 30353,12373 29686,11706 "
transform="matrix(0.90647511,0,0,0.90647511,2687.9989,870.71675)" /> transform="matrix(0.88407596,0.20026688,-0.20026688,0.88407596,8037.0631,-4399.335)" />
<path
inkscape:connector-curvature="0"
id="path3102"
d="m 17406.182,17943.438 a 1137.2016,1137.2016 0 0 0 856.574,1485.415"
style="color:#000000;fill:none;stroke:#000000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;stroke-dashoffset:0;marker:none;visibility:visible;display:inline;overflow:visible;enable-background:accumulate" />
<polyline
id="polyline3116"
style="color:#000000;fill:none;stroke:#000000;stroke-width:64.65872192;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;stroke-dashoffset:0;marker:none;visibility:visible;display:inline;overflow:visible;enable-background:accumulate"
points="1800,6300 2100,5400 2400,6300 "
transform="matrix(1.4692527,0,0,1.4692527,14411.845,10946.857)" />
<path
inkscape:connector-curvature="0"
id="path3102-0"
d="m 20305.836,17943.438 a 1137.2016,1137.2016 0 0 0 856.574,1485.415"
style="color:#000000;fill:none;stroke:#000000;stroke-width:95;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;stroke-dashoffset:0;marker:none;visibility:visible;display:inline;overflow:visible;enable-background:accumulate" />
<polyline
id="polyline3116-2"
style="color:#000000;fill:none;stroke:#000000;stroke-width:64.65872192;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;stroke-dashoffset:0;marker:none;visibility:visible;display:inline;overflow:visible;enable-background:accumulate"
points="1800,6300 2100,5400 2400,6300 "
transform="matrix(1.4692527,0,0,1.4692527,17311.499,10946.857)" />
</svg> </svg>
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment