diff --git a/doc/papers/2011/europar/coherent-dedispersion.jgr b/doc/papers/2011/europar/coherent-dedispersion.jgr index 715210ea1712d579d5a4e1fdb5b9e0b2219789e7..1e0baa8cc59b412248d1bc9ea40e5b97848f7d70 100644 --- a/doc/papers/2011/europar/coherent-dedispersion.jgr +++ b/doc/papers/2011/europar/coherent-dedispersion.jgr @@ -1,25 +1,34 @@ newgraph xaxis min 0 max 2 - label : Rotational phase no_auto_hash_labels + (* + label : Rotational phase + hash_label at 0 : 0 hash_label at 1 : 2p hash_label at 2 : 4p hash_labels font Symbol + *) + + label : Time (ms) + + hash_label at 0 : 0 + hash_label at 1 : 1.88 + hash_label at 2 : 3.76 yaxis - min 0 max 1.5 + min 0 max 2 nodraw legend x 0.5 y 1.2 newline - label : Subband-level dedispersion + label : Without dedispersion linetype dashed linethickness 2.0 color 0 0 1 @@ -36,7 +45,7 @@ newline do echo $N / 23 | bc -l; echo "($i + 0.13795437)/(0.83333333 + 0.13795437)" | bc -l; N=$[$N+1]; done newline - label : Channel-level dedispersion + label : With dedispersion linetype solid linethickness 2.0 color 1 0 0 diff --git a/doc/papers/2011/europar/lofar.pdf b/doc/papers/2011/europar/lofar.pdf index 2e6d76a5006d7329d9f1a8994bbd791676a24f0e..c420ccccca4e28a5a3fa70027826dceb0b549a42 100644 Binary files a/doc/papers/2011/europar/lofar.pdf and b/doc/papers/2011/europar/lofar.pdf differ diff --git a/doc/papers/2011/europar/lofar.tex b/doc/papers/2011/europar/lofar.tex index cba47ff136fa65c7d3082a3aebc0e8a3faf5ebb3..1eb3614b58705318ef04cdcf35ee225c6f14090c 100644 --- a/doc/papers/2011/europar/lofar.tex +++ b/doc/papers/2011/europar/lofar.tex @@ -124,9 +124,9 @@ We use an IBM BlueGene/P (BG/P) supercomputer for the real-time processing of st Our system consists of 3 racks, with 12,480 processor cores that provide 42.4 TFLOPS peak processing power. One chip contains four PowerPC~450 cores, running at a modest 850~Mhz clock speed to reduce power consumption and to increase package density. Each core has two floating-point units (FPUs) that provide support for operations on complex numbers. The chips are organised in \emph{psets}, each of which consists of 64 cores for computation (\emph{compute cores}) and one chip for communication (\emph{I/O node}). Each compute core runs a fast, simple, single-process kernel (the Compute Node Kernel, or CNK), and has access to 512 MiB of memory. The I/O nodes consist of the same hardware as the compute nodes, but additionally have a 10~Gb/s Ethernet interface connected. Also, they run Linux, which allows the I/O nodes to do full multitasking. One rack contains 64 psets, which is equal to 4096 compute cores and 64 I/O nodes. -The BG/P contains several networks. A fast \emph{3-dimensional torus\/} connects all compute nodes and is used for point-to-point and all-to-all communications. The torus uses DMA to offload the CPUs and allows asynchronous communication. The \emph{collective network\/} is used for MPI collective operations, but also for external communication. External communication is routed within each pset, through its I/O node using a tree configuration. The I/O node is capable of transporting 6.8~Gb/s bidirectionally to and from one or more compute nodes. In both networks, data is routed through compute nodes using a shortest path. Additional networks exist for fast barriers, initialization, diagnostics, and debugging. % TODO: add cross-section/link bandwidth for 3D torus? +The BG/P contains several networks. A fast \emph{3-dimensional torus\/} connects all compute nodes and is used for point-to-point and all-to-all communications over 3.4~Gb/s links. The torus uses DMA to offload the CPUs and allows asynchronous communication. The \emph{collective network\/} is used for communication within a pset between an I/O node and the compute nodes, using 6.8~Gb/s links. In both networks, data is routed through compute nodes using a shortest path. Additional networks exist for fast barriers, initialization, diagnostics, and debugging. -\subsection{External Networks} +\subsection{External I/O} \label{Sec:Networks} \begin{figure}[ht] @@ -144,10 +144,7 @@ The BG/P contains several networks. A fast \emph{3-dimensional torus\/} connects \end{minipage} \end{figure} - -We run a multi-threaded program on each I/O~node which is responsible for two tasks: the handling of input, and the handling of output (see Figure \ref{fig:ion-processing}). Even though the I/O nodes each have a 10~Gb/s Ethernet interface, they do not have enough computation power to handle 10~Gb/s of data. The overhead of handling IRQs, IP, and UDP/TCP put such a high load on the 850~MHz cores of the I/O nodes, that the performance seems limited to a total data rate of roughly 4~Gb/s. To achieve these data rates, we installed our own software on the I/O nodes, augmenting IBM's software stack~\cite{Yoshii:10}, and we implemented a low-overhead communication protocol called FCNP~\cite{Romein:09a} to efficiently transport data to and from the compute nodes. Recall that at full observational bandwidth, a station produces 3.1~Gb/s of data. Each I/O node can thus receive data from at most one station. The I/O nodes forward the station data to the compute nodes. The compute nodes convert the (complex) integer samples to the (complex) float domain, and perform all of the necessary on-line signal processing. - -Once the compute nodes have finished processing the data, the results are sent back to the I/O nodes. The I/O nodes forward these results to our 24-node storage cluster. Each I/O node can send up to 1.1~Gb/s if it receives station data at the same time, and up to 3.1~Gb/s if the I/O node is not receiving station data. The storage cluster itself can handle up to 80~Gb/s of sustained throughput. The output queue which is maintained at the I/O node for all data uses a best-effort policy and drops data if it cannot be sent, in order to keep the BG/P running at real time. +We customised the I/O node software stack~\cite{Yoshii:10} and run a multi-threaded program on each I/O~node which is responsible for two tasks: the handling of input, and the handling of output (see Figure \ref{fig:ion-processing}). Even though the I/O nodes each have a 10~Gb/s Ethernet interface, they do not have enough computation power to handle 10~Gb/s of data. The overhead of handling IRQs, IP, and UDP/TCP put a high load on the 850~MHz cores of the I/O nodes, limiting performance. An I/O node can output at most 3.1~Gb/s, unless it has to handle station input (3.1~Gb/s), in which case it can output at most 1.1~Gb/s. We implemented a low-overhead communication protocol called FCNP~\cite{Romein:09a} to efficiently transport data to and from the compute nodes. Each I/O node forwards its data to the compute nodes, which perform all of the necessary processing. Once the compute nodes have finished processing the data, the results are sent back to the I/O nodes. The I/O nodes forward these results to our storage cluster, which can sustain a throughput up to 80~Gb/s. The I/O node drops output data if it cannot be sent, in order to keep the system running at real time. \comment{ BG/P explanation: @@ -160,44 +157,33 @@ Once the compute nodes have finished processing the data, the results are sent b \section{Beamforming} \label{Sec:Beamforming} -As mentioned in Section \ref{Sec:LOFAR}, a LOFAR station is aimed at a source by applying different delays to the signals from the antennas, and subsequently adding the delayed signals. Delaying the signals is known as \emph{delay compensation}, and subsequently adding them as \emph{beamforming}, which is performed at several stages in our processing pipeline. The station beamformer is implemented in hardware, in which the signal is delayed by switching it over wires of different lengths. The signals from the different antennas are subsequently added using an FPGA. The resulting \emph{station beam}, even though it is aimed at a certain source, is nevertheless still sensitive to signals from a large area around the source. - -The BG/P, in which the signals from all the LOFAR stations come together, again performs delay compensation by adding the appropriate delays to the signals from the various stations, this time in software. Because the beam produced by each station has a large sensitive area around the source, the BG/P beamformer can point at the source at which the station beams are pointing, but also at sources around it. An example is shown in Figure \ref{fig:pencilbeams}. The station beam, represented by an ellipse, is sensitive to a sky region around the source it is pointed at. The BG/P can process the data from the stations such that the focus is shifted to sources within the sensitive region, creating \emph{tied-array beams}, which are represented in the figure by hexagons. The actual width of the station beam, as well as the width of the tied-array beams, depends on the number as well as the locations of the stations used. Hundreds of tied-array beams are typically needed to fully cover the sensitive region of a station beam. - -Different tied-array beams are formed by adding the signals from the individual stations over and over again, each time using different delays. The delay that has to be applied depends on the relative positions of the stations and the relative direction of the tied-array beam with respect to the station beam. The delays are applied in software in two phases. First, the streams are aligned by shifting them a whole number of samples with respect to each other, which resolves delay differences up to the granularity of a single sample. Then, the remaining sub-sample delays are compensated for by shifting the phase of the signal. Even though a phase correction is not a proper shift because information does not shift between one sample and the next, the difference proves to be sufficiently small in practice. +A station focusses at a source by applying different delays to the signals from its antennas, and subsequently adding the delayed signals. Delaying the signals is known as \emph{delay compensation}, and subsequently adding them as \emph{beam forming}. The station beam former is implemented in hardware, in which the signal is delayed by switching it over wires of different lengths. The signals from the different antennas are subsequently added using an FPGA. The resulting \emph{station beam}, even though it is focussed on a certain source, still has a wide field of view. -This approximation is in fact good enough to limit the generation of different tied-array beams through phase corrections alone. The samples coming from different stations are shifted the same amount for all beams that are created. Because a phase shift can be performed using a complex multiplication, the beam former in the BG/P only has to accumulate the weighted vectors of samples from each station. Let $\overrightarrow{S_i}$ be the stream of samples from station $i$, $\overrightarrow{B_j}$ the stream of samples of beam $j$, and $w_{ij}$ the phase correction to be applied on station $i$ to represent the required delay for beam $j$. Then, the beam former calculates $\overrightarrow{B_j} = \sum_{i \in \textrm{stations}}w_{ij}\overrightarrow{S_i}$ on the samples of both the X and Y polarisations independently, to obtain beam $j$. The beam former can be easily parallised. As mentioned in Section \ref{Sec:LOFAR}, the station data consist of a sample stream consisting of up to 248 subbands. The subbands are independent and can thus be processed in parallel. Also, the streams can quite trivially be split along the time dimension in order to increase parallelisation. +The BG/P, which receives the signals from all stations, again performs delay compensation and beam forming, this time in software. The BG/P beam former can focus on sources anywhere in the fields of view of the station beams, creating \emph{tied-array beams} (beams). An example is shown in Figure \ref{fig:pencilbeams}, in which a station beam (represented by an ellipse) contains several tied-array beams (represented by hexagons). The actual width of the station beam, as well as the width of the tied-array beams, depends on the number as well as the locations of the stations used. Hundreds of tied-array beams are typically needed to fully cover the field of view of a station beam. -A beam $\overrightarrow{B_j}$ formed at the BG/P consists of a stream of complex 32-bit floating point numbers, two for each time sample (representing the X and Y polarisations), which is equal to 6.2~Gb/s at LOFAR's full observational bandwidth. For some observations however, such a precision is not required, and the beams can be reduced in size in order to be able to output more beams in parallel. In this paper, we consider two types of observations. First, we consider \emph{high-resolution} observations, in which an astronomer wants to look at sources at the highest resolution possible. Such observations will typically produce 6.2~Gb/s per beam. The second type of observation is a \emph{many-beams} observations, in which an astronomer wants to survey the sky with as many beams as possible, given a lower bound on the acceptable resolution. Because each individual beam can be recorded with a much lower resolution than in the high-resolution observations, bandwidth becomes available to create more beams. +Different tied-array beams are created by adding the signals from the individual stations using different delays. The delays that have to be applied to obtain a tied-array beam depends on the relative positions of the stations and the relative direction of the tied-array beam with respect to the station beam. The delays are applied in two phases. First, the streams are aligned by shifting them a whole number of samples with respect to each other, which resolves delay differences up to the granularity of a single sample. Then, the remaining sub-sample delays are compensated for by shifting the phase of the signal. In order to obtain different tied-array beams, only the sub-sample delays have to be adjusted. A phase shift is performed by applying a complex multiplication. To form a beam, the beam former gathers the streams of samples from the stations, multiplies them with precomputed weights representing the required phase shift, and adds the streams together. The same weights are applied to both the X and the Y polarisations. The resulting data stream is called the \emph{XY polarisations}, and consists of 32-bit complex floating point numbers. -The two types of observations translate into three modes in which our pipeline can run: -\begin{description} -\item[Complex Voltages] are the untransformed beams as produced by the beamformer. For each beam, the complex 32-bit float samples for the X and Y polarisations are split and stored in two separate files in disk, resulting in two 3.1~Gb/s streams to disk per beam. -\item[Stokes IQUV] parameters represent the polarisation aspects of the signal, and are the result of a domain transformation performed on the complex voltages: $I = X\overline{X} + Y\overline{Y}$, $Q = X\overline{X} - Y\overline{Y}$, $U = 2\mathrm{Re}(X\overline{Y})$, $V = 2\mathrm{Im}(X\overline{Y})$. The transformation is useful for polarisation-related studies. Each Stokes parameter consists of one 32-bit float per time sample, and is stores in a seperate file. The Stokes IQUV mode thus results in four 1.5~Gb/s streams to disk per beam. % TODO: use of Stokes IQUV -\item[Stokes I] represents the power of the signal in the X and Y polarisations combined, and is equal to the Stokes I stream in the Stokes IQUV mode. This mode thus results in one 1.5~Gb/s stream to disk per beam. Integrating the samples over time is supported in this mode. Time integration reduces the bandwidth per beam by an integer factor, but reduces the time resolution as well. -\end{description} +The XY polarisations can optionally be converted into \emph{Stokes IQUV} parameters, which represent the polarisation aspects in an alternative way. The Stokes parameters are defined as $I = X\overline{X} + Y\overline{Y}$, $Q = X\overline{X} - Y\overline{Y}$, $U = 2\mathrm{Re}(X\overline{Y})$, $V = 2\mathrm{Im}(X\overline{Y})$, with each parameter being a 32-bit real floating point number. -%TODO: Incoherent stokes +Both the XY polarisations and the Stokes IQUV parameters require up to 6.2~Gb/s per beam, which severely limits the number of beams that can be produced, and which represents a time resolution which is not always necessary. For example, in sky surveys, it is desirable to create many beams. The data rate per beam thus has to be lowered. For many-beam observations, we convert the XY polarisations into just the \emph{Stokes I} parameter, which represents the amplitude of the signal in the X and Y polarisations combined. The resulting data rate is 1.5~Gb/s per beam, but we also allow time-wise integration to further reduce the data rate with an integer factor, allowing even more beams to be created. -%The BG/P is able to produce tens to hundreds of beams, depending on the mode used. As our measurements will show, the Complex Voltages and Stokes IQUV modes will typically hit an I/O bottleneck when transporting the created beams towards the storage cluster. In the Stokes I mode, the bandwidth per beam is lower and can be further reduced using integration. The I/O bottleneck can thus be avoided in Stokes I mode. If the number of beams is increased, an upper limit on the available memory and computational power will be reached instead. +For each beam, each polarisation (X, Y) or Stokes parameter (I, Q, U, V) is transported and stored in seperate files. % TODO: more splits.. % TODO: incoherent stokes \section{Pulsar Pipeline} %To observe known pulsars, our beamformer is put in the high-resolution mode, in which Complex Voltages or Stokes IQUV parameters are recorded at full bandwidth in order to closely study the shapes of the individual pulses. -In this section, we will describe in detail how the full signal-processing pipeline operates, in and around the beamformer. Much of the pipeline's operation and design is similar to our standard imaging pipeline, described in \cite{Romein:10a}. +In this section, we will describe in detail how the full signal-processing pipeline operates, in and around the beamformer. The use of a software pipeline allows us to reconfigure the components and design of our standard imaging pipeline, described in \cite{Romein:10a}. In fact, both pipelines can be run simultaneously. \subsection{Input from Stations} -The first step in the pipeline is receiving and collecting from the stations on the I/O nodes. Each I/O node receives the data of (at most) one station, and stores the received data in a circular buffer (recall Figure \ref{fig:ion-processing}). If necessary, the read pointer of the circular buffer is shifted a number of samples to reflect the coarse-grain delay compensation that will be necessary to align the streams from different stations, based on the location of the source at which the stations are pointed. +The first step in the pipeline is receiving and collecting from the stations on the I/O nodes. Each I/O node receives the data of (at most) one station, and stores the received data in a circular buffer (recall Figure \ref{fig:ion-processing}). If necessary, the read pointer of the circular buffer is shifted a number of samples to reflect the coarse-grain delay compensation that will be necessary to align the streams from different stations. -The station data are split into chunks of one subband and approximately 0.25 seconds. Such a chunk is the unit of data on which a compute node will operate. The size of a chunk is chosen such that the compute nodes will have enough memory to perform the necessary operations on them. - -To perform beamforming, the compute nodes need chunks from all stations. Unfortunately, an I/O node can only communicate (efficiently) with the compute nodes in its own pset, which makes it impossible to send the chunks directly to the compute cores that will process them. Instead, the I/O node distributes its chunks over its compute cores in a round-robin fashion, after which the compute cores obtain trade different chunks from the same station for the same chunk from different stations, using an all-to-all exchange. +The station data are split into chunks of one subband and approximately 0.25 seconds. The chunk size is chosen such that the compute cores have enough memory to perform all of the necessary processing. Due to the BG/P design, an I/O node sends chunks to its own compute cores only. The compute cores exchange the chunks they obtain from their I/O node using an all-to-all exchange. \subsection{First All-to-all Exchange} -The first all-to-all exchange in the pipeline allows the compute cores to distribute the chunks from a single station, and to collect all the chunks of the same frequency band produced by all of the stations. The exchange is performed over the fast 3D-torus network, but with up to 198~Gb/s of station data to be exchanged (=64 stations producing 3.1~Gb/s), special care still has to be taken to avoid network bottlenecks. It is impossible to optimise for short network paths due to the physical distances between the different psets across a BG/P rack. Instead, we optimised the data exchange by creating as many paths as possible between compute cores that have to exchange data. Within each pset, we employ a virtual remapping such that the number of possible routes between communicating cores in different psets is maximised. +The first all-to-all exchange in the pipeline allows the compute cores to distribute the chunks from a single station, and to collect all the chunks of the same subband from all of the stations. The exchange is performed over the fast 3D-torus network, but with up to 198~Gb/s of station data to be exchanged (64 stations producing 3.1~Gb/s), special care still has to be taken to avoid network bottlenecks. It is impossible to optimise for short network paths due to the physical distances between the different psets across a BG/P rack. Instead, we optimised the data exchange by creating as many paths as possible between compute cores that have to exchange data. Within each pset, we employ a virtual remapping such that the number of possible routes between communicating cores in different psets is maximised. The communications in the all-to-all exchange are asynchronous, which allows a compute core to start processing a subband from a station as soon as it arrives, up to the point that data from all stations are required. Communication and computation are thus overlapped as much as possible. @@ -217,29 +203,25 @@ Up to this point, processing chunks from different stations can be done independ The beamformer creates the beams as described in Section \ref{Sec:Beamforming}. First, the different weights required for the different beams are computed, based on the station positions and the beam directions. Note that the data in the chunks are already delay compensated with respect to the source at which the stations are pointed. Any delay compensation performed by the beamformer is therefore to compensate the delay differences between the desired beams and the station's source. The reason for this two-stage approach is flexibility. By already compensating for the station's source in the previous step, the resulting data can not only be fed to the beamformer, but also to other pipelines, such as the imaging pipeline. Because we have a software pipeline, we can implement and connect different processing pipelines with only a small increase in complexity. -The delays are applied to the station data through complex multiplications and additions. In order to take full advantage of the L1 cache and the available registers, data is processed in sets of 6 stations, producing 3 beams, or a subset thereof to cover the remaining stations and beams. While the exact ideal set size in which the data is to be processed depends on the architecture at hand, we have shown in previous work that similar tradeoffs exist for similar problems across different architectures~\cite{FOO,BAR}. +The delays are applied to the station data through complex multiplications and additions, programmed in assembly. In order to take full advantage of the L1 cache and the available registers, data is processed in sets of 6 stations, producing 3 beams, or a subset thereof to cover the remaining stations and beams. While the exact ideal set size in which the data is to be processed depends on the architecture at hand, we have shown in previous work that similar tradeoffs exist for similar problems across different architectures~\cite{FOO,BAR}. -Because each beam is an accumulated combination of the data from all stations, the bandwidth of each beam is equal to the bandwidth of data from a single station, which is 6.2~Gb/s now that the samples are 32-bit floating point numbers. Once the beams are formed, they can be kept as such (Complex Voltages) or optionally be transformed into the Stokes IQUV or the Stokes I parameters. In the latter case, the beams can also be integrated time-wise, in which groups of samples of fixed size are accumulated together to reduce the resulting data rate. +Because each beam is an accumulated combination of the data from all stations, the bandwidth of each beam is equal to the bandwidth of data from a single station, which is 6.2~Gb/s now that the samples are 32-bit floating point numbers. Once the beams are formed, they are kept as XY polarisations or transformed into the Stokes IQUV or the Stokes I parameters. In the latter case, the beams can also be integrated time-wise, in which groups of samples of fixed size are accumulated together to reduce the resulting data rate. The beamformer transforms chunks representing station data into chunks representing beam data. Because a chunk representing station data contained data for only one subband, the chunks representing different subbands of the same beam are still spread out over the full BG/P. Chunks corresponding to the same beam are brought together using a second all-to-all exchange. -\subsection{Dedispersion} - -\subsection{Channel-level dedispersion} +\subsection{Channel-level Dedispersion} -Another major component in the pulsar-observation pipeline is real-time dedispersion. Since light of a high frequency travels faster through the interstellar medium than light of a lower frequency, the arrival time of a pulse differs for different wave lengths. To combine data from multiple frequency channels, the channels must be aligned (i.e., shifted in time), otherwise, the pulse will be smeared and becomes invisible. This process, called \emph{dedispersion}, is done by post-processing software that runs after the observation has finished. However, to observe at the lowest frequencies, or to observe fast-rotating millisecond pulsars, dedispersion must also be performed \emph{within\/} a channel, since our channels (typically 12~KHz) are too wide to ignore dispersion. Channel-level dedispersion can only be done before integration (Section ...), thus in the real-time pipeline. +Another major component in the pulsar-observation pipeline is real-time dedispersion. Since light of a high frequency travels faster through the interstellar medium than light of a lower frequency, the arrival time of a pulse differs for different wave lengths. To combine data from multiple frequency channels, the channels must be aligned (shifted in time). Otherwise, the pulse will be smeared, causing any details to be lost. This process, called \emph{dedispersion}, is done by post-processing software that runs after the observation has finished. However, to observe at the lowest frequencies, or to observe fast-rotating millisecond pulsars, dedispersion must also be performed \emph{within\/} a channel, since our channels (typically 12~KHz) are too wide to ignore dispersion. -Dedispersion is performed in the frequency domain, effectively by doing a 4K~Fourier transform that splits a 12~KHz channel into 3~Hz subchannels. The phases of the observed samples are corrected by applying a Chirp function~\cite{...}, i.e., by multiplication with precomputed, subchannel-dependent, complex weights. These multiplications are programmed in assembly, to reduce the computational costs. A backward FFT is done to revert to 12~KHz channels. +Dedispersion is performed in the frequency domain, effectively by doing a 4K~Fourier transform that splits a 12~KHz channel into 3~Hz subchannels. The phases of the observed samples are corrected by applying a Chirp function~\cite{...}, i.e., by multiplication with precomputed, subchannel-dependent, complex weights. These multiplications are programmed in assembly, to reduce the computational costs. A backward FFT is done to revert to 12~KHz channels. \begin{figure}[ht] -\includegraphics[width=\textwidth]{coherent-dedispersion.pdf} +\includegraphics[width=0.5\textwidth]{coherent-dedispersion.pdf} \label{fig:dedispersion-result} -\caption{Pulse profiles of pulsar J0034-0534, with dedispersion applied at channel and at subband level.} +\caption{Pulse profiles of pulsar J0034-0534, with and without dedispersion.} \end{figure} -Figure~\ref{fig:dedispersion-result} shows the effectiveness of channel-level dedispersion, where we observed pulsar J0034-0534 with a pulse period of 1.88~ms with and without dedispersion. With dedispersion, the pulse is narrower and has the correct shape, while the noise floor is lower. - -Dedispersion is done before or after beam forming, depending on the number of stations and the number of beams, wherever the computational costs are minimal. Still, the computational costs are high (see ...). Yet, this pipeline component significantly contributes to the data quality of the LOFAR telescope. It also demonstrates the power of using a \emph{software\/} telescope; the pipeline component was implemented, verified, and optimized in only one month time. +Figure~\ref{fig:dedispersion-result} shows the effectiveness of channel-level dedispersion, where we observed pulsar J0034-0534 with a pulse period of 1.88~ms with and without dedispersion. By applying dedispersion, the effective time resolution is improved from 0.51~ms to 0.082~ms, revealing a narrower, more detailed pulse above a lower noise floor. Dedispersion thus contributes significantly to the data quality of the LOFAR telescope, but it also comes at a significant computational cost due to the two FFTs it requires. It also demonstrates the power of using a \emph{software\/} telescope; the pipeline component was implemented, verified, and optimized in only one month time. \subsection{Second All-to-all Exchange}