diff --git a/doc/papers/2011/europar/lofar.pdf b/doc/papers/2011/europar/lofar.pdf
index 50153b50d5f8f2d901550eec28522a392a0e3606..944036382093f3a5f120e4e632cc87a41a824773 100644
Binary files a/doc/papers/2011/europar/lofar.pdf and b/doc/papers/2011/europar/lofar.pdf differ
diff --git a/doc/papers/2011/europar/lofar.tex b/doc/papers/2011/europar/lofar.tex
index 51622edb70455e3453c6154b9450c2902282b7f0..956071e07c65f0f135d8d1f051bda0139ddd5a2b 100644
--- a/doc/papers/2011/europar/lofar.tex
+++ b/doc/papers/2011/europar/lofar.tex
@@ -59,9 +59,9 @@ Another novelty is the elaborate use of software to process the telescope data i
 
 For processing LOFAR data, we use an IBM BlueGene/P (BG/P) supercomputer. The LOFAR antennas are grouped into stations, and each station sends its data (up to 200 Gb/s for all stations) to the BG/P super computer. Inside the BG/P, the data are split and recombined using both real-time signal processing routines as well as two all-to-all exchanges. The output data streams are sufficiently reduced in size in order to be able to stream them out of the BG/P and store them on disks in our storage cluster.
 
-The stations can be configured to observe in several directions in parallel, but have to divide their output bandwidth among them. In this paper, we present the \emph{pulsar pipeline}, an extension to the LOFAR software which allows the telescope to be aimed in tens of directions simultaneously at LOFAR's full observational bandwidth, and in hundreds of directions at reduced bandwidth. Both feats cannot be matched by any other telescope. The data streams corresponding to each observational direction, called \emph{beams}, are generated through (weighted) summations of the station inputs, which are demultiplexed using an all-to-all exchange, and routed to the storage cluster.
+The stations can be configured to observe in several directions in parallel, but have to divide their output bandwidth among them. In this paper, we present the \emph{beamformer}, an extension to the LOFAR software which allows the telescope to be aimed in tens of directions simultaneously at LOFAR's full observational bandwidth, and in hundreds of directions at reduced bandwidth. Both feats cannot be matched by any other telescope. The data streams corresponding to each observational direction, called \emph{beams}, are generated through (weighted) summations of the station inputs, which are demultiplexed using an all-to-all exchange, and routed to the storage cluster.
 
-Using the pulsar pipeline, astronomers can focus on known pulsars, planets, exoplanets, the sun, and other objects, with unprecedented sensitivity. Furthermore, our pipeline allows fast broad-sky surveys to discover new pulsars, by observing in hundreds of directions in parallel. 
+The primary scientific use case driving the work presented in this paper is pulsar research. A pulsar is a rapidly rotating, highly magnetised neutron star, which emits electromagnetic radiation from its poles. Similar to the behaviour of a lighthouse, the radiation is visible to us only if one of the poles points towards Earth, and subsequently appears to us as a very regular series of pulses, with a period as low as 1.4~ms~\cite{Hessels:06}. Pulsars are relatively weak radio sources, and their individual pulses often do not rise above the background noise that fills our universe. LOFAR is one of the few telescopes which operates in the frequency range (10 -- 240 MHz) in which pulsars are typically at their brightest. Our beamformer also makes LOFAR the only telescope that can observe in hundreds of directions simultaneously with high sensitivity. These aspects make LOFAR an ideal instrument to discover unknown pulsars by doing a sensitive sky survey in a short amount of time, as well as an ideal instrument to study known pulsars in more detail. Astronomers can also use our beamformer to focus on planets, exoplanets, the sun, and other radio objects, with unprecedented sensitivity. Furthermore, our pipeline allows fast broad-sky surveys to discover not only new pulsars but also other radio sources.
 
 % TODO: mention other uses / observation types
 
@@ -119,9 +119,10 @@ We use an IBM BlueGene/P (BG/P) supercomputer for the real-time processing of st
 
 Our system consists of 3 racks, with 12,480 processor cores that provide 42.4 TFLOPS peak processing power. One chip contains four PowerPC~450 cores, running at a modest 850~Mhz clock speed to reduce power consumption and to increase package density. Each core has two floating-point units (FPUs) that provide support for operations on complex numbers. The chips are organised in \emph{psets}, each of which consists of 64 cores for computation (\emph{compute cores}) and one chip for communication (\emph{I/O node}). Each compute core runs a fast, simple, single-process kernel (the Compute Node Kernel, or CNK), and has access to 512 MiB of memory. The I/O nodes consist of the same hardware as the compute nodes, but additionally have a 10~Gb/s Ethernet interface connected. Also, they run Linux, which allows the I/O nodes to do full multitasking. One rack contains 64 psets, which is equal to 4096 compute cores and 64 I/O nodes.
 
-The BG/P contains several networks. A fast \emph{3-dimensional torus\/} connects all compute nodes and is used for point-to-point and all-to-all communications. The torus uses DMA to offload the CPUs and allows asynchronous communication. The \emph{collective network\/} is used for MPI collective operations, but also for external communication. External communication is routed within each pset, through its I/O node using a tree configuration. The I/O node is capable of transporting 6.8~Gb/s bidirectionally to and from one or more compute nodes. Additional networks exist for fast barriers, initialization, diagnostics, and debugging. % TODO: add cross-section/link bandwidth for 3D torus?
+The BG/P contains several networks. A fast \emph{3-dimensional torus\/} connects all compute nodes and is used for point-to-point and all-to-all communications. The torus uses DMA to offload the CPUs and allows asynchronous communication. The \emph{collective network\/} is used for MPI collective operations, but also for external communication. External communication is routed within each pset, through its I/O node using a tree configuration. The I/O node is capable of transporting 6.8~Gb/s bidirectionally to and from one or more compute nodes. In both networks, data is routed through compute nodes using a shortest path. Additional networks exist for fast barriers, initialization, diagnostics, and debugging. % TODO: add cross-section/link bandwidth for 3D torus?
 
-\subsection{External Connections}
+\subsection{External Networks}
+\label{Sec:Networks}
 
 \begin{figure}[ht]
 \includegraphics[width=\textwidth]{ION-processing.pdf}
@@ -144,22 +145,24 @@ Once the compute nodes have finished processing the data, the results are sent b
 \section{Beamforming}
 \label{Sec:Beamforming}
 
-As mentioned in Section \ref{Sec:LOFAR}, a LOFAR station is aimed at a source by applying different delays to the signals from the antennas, and subsequently adding the delayed signals. This process is known as \emph{beamforming}, which is performed at several stages in our processing pipeline. The station beamformer is implemented in hardware, in which the signal is delayed by switching it over wires of different lengths. The signals from the different antennas are subsequently added using an FPGA. The resulting beam, even though it is aimed at a certain source, is nevertheless still sensitive to signals from a large area around the source.
+As mentioned in Section \ref{Sec:LOFAR}, a LOFAR station is aimed at a source by applying different delays to the signals from the antennas, and subsequently adding the delayed signals. Delaying the signals is known as \emph{delay compensation}, and subsequently adding them as \emph{beamforming}, which is performed at several stages in our processing pipeline. The station beamformer is implemented in hardware, in which the signal is delayed by switching it over wires of different lengths. The signals from the different antennas are subsequently added using an FPGA. The resulting beam, even though it is aimed at a certain source, is nevertheless still sensitive to signals from a large area around the source.
 
-The BG/P, in which the signals from all the LOFAR stations come together, again performs beamforming by adding the appropriate delays to the signals from the various stations, this time in software. Because the beam produced by each station has a large sensitive area around the source, the BG/P beamformer can not only aim at the source for which the stations are configured, but also at sources around it. Different beams can then be created by adding the signals from the individual stations over and over again, each time using different delays. The delay that has to be applied depends on the relative positions of the stations and the relative direction of the beam with respect to the source. The delays are applied in software in two phases. First, the streams are aligned by shifting them a whole number of samples with respect to each other, which resolves delay differences up to the granularity of a single sample. Then, the remaining sub-sample delays are compensated for by shifting the phase of the signal. Even though a phase correction is not a proper shift because information does not shift between one sample and the next, it proves to be a good enough approximation.
+% TODO: add pic
+The BG/P, in which the signals from all the LOFAR stations come together, again performs delay compensation by adding the appropriate delays to the signals from the various stations, this time in software. Because the beam produced by each station has a large sensitive area around the source, the BG/P beamformer can not only point at the source for which the stations are configured, but also at sources around it. Different beams can then be formed by adding the signals from the individual stations over and over again, each time using different delays. The delay that has to be applied depends on the relative positions of the stations and the relative direction of the beam with respect to the source. The delays are applied in software in two phases. First, the streams are aligned by shifting them a whole number of samples with respect to each other, which resolves delay differences up to the granularity of a single sample. Then, the remaining sub-sample delays are compensated for by shifting the phase of the signal. Even though a phase correction is not a proper shift because information does not shift between one sample and the next, the difference proves to be sufficiently small in practice.
 
 % TODO: pencil beams fit within station beam, etc
 This approximation is in fact good enough to limit the generation of different beams through phase corrections alone. The samples coming from different stations are shifted the same amount for all beams that are created. Because a phase shift can be performed using a complex multiplication, the beam former in the BG/P only has to accumulate the weighted vectors of samples from each station. Let $\overrightarrow{S_i}$ be the stream of samples from station $i$, $\overrightarrow{B_j}$ the stream of samples of beam $j$, and $w_{ij}$ the phase correction to be applied on station $i$ to represent the required delay for beam $j$. Then, the beam former performs the following calculation on the samples of both the X and Y polarisations independently, to obtain beam $j$:
 \begin{eqnarray}
 \overrightarrow{B_j} & = & \sum_{i \in \textrm{stations}}w_{ij}\overrightarrow{S_i}.
 \end{eqnarray}
+Note that this equation allows for easy parallellisation. As mentioned in Section \ref{Sec:LOFAR}, the station data consist of a sample stream consisting of up to 248 subbands. The subbands are independent and can thus be processed in parallel. Also, the streams can quite trivially be split along the time dimension in order to increase parallelisation. 
 
-A beam $\overrightarrow{B_j}$ formed at the BG/P consists of a stream of complex 32-bit floating point numbers, two for each time sample (representing the X and Y polarisations), which is equal to 6.2~Gb/s at LOFAR's full observational bandwidth. For some observations however, such a precision is not required, and the beams can be reduced in size in order to be able to output more beams in parallel. In this paper, we consider two types of observations. First, we consider \emph{high-resolution} observations, in which an astronomer wants to look at sources at the highest resolution possible. Such observations will typically produce 6.2~Gb/s per beam. The second type of observation is a \emph{many beams} observations, in which an astronomer wants to survey the sky with as many beams as possible, given a lower bound on the acceptable resolution. Because each individual beam can be recorded with a much lower resolution than in the high-resolution observations, bandwidth becomes available to create more beams.
+A beam $\overrightarrow{B_j}$ formed at the BG/P consists of a stream of complex 32-bit floating point numbers, two for each time sample (representing the X and Y polarisations), which is equal to 6.2~Gb/s at LOFAR's full observational bandwidth. For some observations however, such a precision is not required, and the beams can be reduced in size in order to be able to output more beams in parallel. In this paper, we consider two types of observations. First, we consider \emph{high-resolution} observations, in which an astronomer wants to look at sources at the highest resolution possible. Such observations will typically produce 6.2~Gb/s per beam. The second type of observation is a \emph{many-beams} observations, in which an astronomer wants to survey the sky with as many beams as possible, given a lower bound on the acceptable resolution. Because each individual beam can be recorded with a much lower resolution than in the high-resolution observations, bandwidth becomes available to create more beams.
 
-The two types of observations are translated into three modes in which our pipeline can run:
+The two types of observations translate into three modes in which our pipeline can run:
 \begin{description}
 \item{Complex Voltages} are the untransformed beams as produced by the beamformer. For each beam, the complex 32-bit float samples for the X and Y polarisations are split and stored in two separate files in disk, resulting in two 3.1~Gb/s streams to disk per beam.
-\item{Stokes IQUV} parameters represent the polarisation aspects of the signal, and are the result of a domain transformation performed on the complex voltages, and are useful for polarisation-related studies. The Stokes parameters consists of four real 32-bit float samples which represent the Stokes I, Q, U and V values for each time sample, which are computed from the complex X and Y polarisations using the following formulas:
+\item{Stokes IQUV} parameters represent the polarisation aspects of the signal, and are the result of a domain transformation performed on the complex voltages. The transformation is useful for polarisation-related studies. The Stokes parameters consists of four real 32-bit float samples which represent the Stokes I, Q, U and V values for each time sample, which are computed from the complex X and Y polarisations using the following formulas:
 \begin{eqnarray}
 I & = & X\overline{X} + Y\overline{Y}, \\
 Q & = & X\overline{X} - Y\overline{Y}, \\
@@ -175,16 +178,55 @@ The BG/P is able to produce tens to hundreds of beams, depending on the mode use
 % TODO: incoherent stokes
 \section{Pulsar Pipeline}
 
-Pulsar research is the primary scientific use case for our beamformer, and thus provides the name for our Pulsar Pipeline, which produces the desired data. We recognise two types of observation. The first type is a survey mode, in which (a portion of) the sky is scanned using many low-bandwidth beams. Interesting sources can subsequently be observed using a few high-resolution beams, which require a lot of bandwidth to record. In this section, we will describe in detail how our pipeline operates. Much of the pipeline's operation and design is similar to our standard imaging pipeline, described in \cite{Romein:10a}.
+%To observe known pulsars, our beamformer is put in the high-resolution mode, in which Complex Voltages or Stokes IQUV parameters are recorded at full bandwidth in order to closely study the shapes of the individual pulses.
+
+In this section, we will describe in detail how the full signal-processing pipeline operates, in and around the beamformer. Much of the pipeline's operation and design is similar to our standard imaging pipeline, described in \cite{Romein:10a}.
 
 \subsection{Input from Stations}
+The first step in the pipeline is receiving and collecting from the stations on the I/O nodes. Each I/O node receives the data of (at most) one station, and stores the received data in a circular buffer (recall Figure \ref{fig:ion-processing}). If necessary, the read pointer of the circular buffer is shifted a number of samples to reflect the coarse-grain delay compensation that will be necessary to align the streams from different stations, based on the location of the source at which the stations are pointed.
 
-The first step in the pipeline is receiving and collecting from the stations on the I/O nodes. Each I/O node receives the data of (at most) one station, and stores the received data in a circular buffer (recall Figure \ref{fig:ion-processing}). The station data is split into chunks of one frequency subband and approximately 0.25 seconds. Such a chunk is the unit of data on which a compute node will operate. The size of a chunk is chosen such that the compute nodes will have enough memory to perform the necessary operations on them.
+The station data are split into chunks of one subband and approximately 0.25 seconds. Such a chunk is the unit of data on which a compute node will operate. The size of a chunk is chosen such that the compute nodes will have enough memory to perform the necessary operations on them.
 
-To perform beamforming, the compute nodes need chunks from all stations. Unfortunately, an I/O node can only communicate (efficiently) with the compute nodes in its own pset, which makes it impossible to send the chunks directly to the compute nodes that will process them. Instead, the I/O node distributes the chunks over its compute nodes in a round-robin fashion, after which the compute nodes obtain trade different chunks from the same station for the same chunk from different stations, using an all-to-all exchange.
+To perform beamforming, the compute nodes need chunks from all stations. Unfortunately, an I/O node can only communicate (efficiently) with the compute nodes in its own pset, which makes it impossible to send the chunks directly to the compute cores that will process them. Instead, the I/O node distributes its chunks over its compute cores in a round-robin fashion, after which the compute cores obtain trade different chunks from the same station for the same chunk from different stations, using an all-to-all exchange.
 
 \subsection{First All-to-all Exchange}
 
+The first all-to-all exchange in the pipeline allows the compute cores to distribute the chunks from a single station, and to collect all the chunks of the same frequency band produced by all of the stations. The exchange is performed over the fast 3D-torus network, but with up to 200~Gb/s of station data to be exchanged (=64 stations producing 3.1~Gb/s), special care still has to be taken to avoid network bottlenecks. It is impossible to optimise for short network paths due to the physical distances between the different psets across a BG/P rack. Instead, we optimised the data exchange by creating as many paths as possible between compute cores that have to exchange data. Within each pset, we employ a virtual remapping such that communicating cores in different psets are colinear (see Figure \ref{fig:colinear}), while the I/O nodes can still address the compute cores in a round-robin fashion.
+
+The communications in the all-to-all exchange are asynchronous, which allows a compute core to start processing a subband from a station as soon as it arrives, up to the point that data from all stations are required. Communication and computation are thus overlapped as much as possible.
+
+\subsection{Pre-beamforming Signal Processing}
+
+Once a compute core receives a chunk, it can start processing. First, we convert the station data from 16-bit little-endian integers to 32-bit big-endian floating point numbers, in order to be able to do further processing using the powerful dual FPU units present in each core. The data doubles in size, which is the main reason why we implement it \emph{after} the exchange.
+
+Next, the data are filtered by applying a Poly-Phase Filter (PPF) bank, which consists of a Finite Impulse Response (FIR) filter and a Fast-Fourier Transform (FFT). The FFT allows the chunk, which represents a subband of 195~KHz, to be split into narrower subbands (\emph{channels}). A higher frequency resolution allows more precise corrections in the frequency domain, such as the removal of radio interference at very specific frequencies.
+
+Next, fine-grain delay compensation is performed to align the chunks from the different stations to the same source at which the stations are pointed. The fine-grain delay compensation is performed as a phase rotation, which is implemented as one complex multiplication per sample. The exact delays are computed for the begin time and end time of a chunk, and interpolated in frequency and time for each individual sample. %TODO: why a frequency-dependent component?
+
+Next, a band pass correction is applied to adjust the signal strengths in all channels. This is necessary, because the stadions introduce a bias in the signal strengths across the channels within a subband.
+
+Up to this point, processing chunks from different stations can be done independently, but from here on, the data from all stations are required. The first all-to-all exchange thus ends here.
+
+\subsection{Beamforming}
+
+The beamformer creates the beams as described in Section \ref{Sec:Beamforming}. First, the different weights required for the different beams are computed, based on the station positions and the beam directions. Note that the data in the chunks are already delay compensated with respect to the source at which the stations are pointed. Any delay compensation performed by the beamformer is therefore to compensate the delay differences between the desired beams and the station's source. The reason for this two-stage approach is flexibility. By already compensating for the station's source in the previous step, the resulting data can not only be fed to the beamformer, but also to other pipelines, such as the imaging pipeline. Because we have a software pipeline, we can implement and connect different processing pipelines with only a small increase in complexity.
+
+The delays are applied to the station data through complex multiplications and additions. In order to take full advantage of the L1 cache and the available registers, data is processed in sets of 6 stations, producing 3 beams, or a subset thereof to cover the remaining stations and beams. While the exact ideal set size in which the data is to be processed depends on the architecture at hand, we have shown in previous work that similar tradeoffs exist for similar problems across different architectures~\cite{FOO,BAR}.
+
+Because each beam is an accumulated combination of the data from all stations, the bandwidth of each beam is equal to the bandwidth of data from a single station, which is 6.2~Gb/s now that the samples are 32-bit floating point numbers. Once the beams are formed, they can be kept as such (Complex Voltages) or optionally be transformed into the Stokes IQUV or the Stokes I parameters. In the latter case, the beams can also be integrated time-wise, in which groups of samples of fixed size are accumulated together to reduce the resulting data rate.
+
+The beamformer transforms chunks representing station data into chunks representing beam data. Because a chunk representing station data contained data for only one subband, the chunks representing different subbands of the same beam are still spread out over the full BG/P. Chunks corresponding to the same beam are brought together using a second all-to-all exchange.
+
+\subsection{Second All-to-all Exchange}
+
+In the second all-to-all exchange, the chunks made by the beamformer are rearranged again using the 3D-torus network. Due to memory constrains on the compute cores, the cores that performed the beamforming cannot be the same cores that receive the beam data after the exchange. Instead, we assign a set of cores (\emph{output cores}) to receive the rearranged chunks of beam data. The output cores are chosen before an observation, and are never scheduled to do participate in the earlier computations in the pipeline (which are done by the \emph{input cores}).
+
+An output core first gathers chunks of beam data that belong to the same beam but represent different subbands. Then, it puts the data in the final ordering, which means both sorting the received chunks and reordering the dimensions of the data within a chunk. Reordering the data within a chunk is necessary, because the data order that will be written to disk is not the same order that can be produced by our computations without taking heavy L1 cache penalties. We hide this reordering cost at the output cores by overlapping computation (the reordering of a chunk) with communication (the arrival of other chunks). Once all of the chunks are received and reordered, they are sent back to the I/O node using the tree network.
+
+For the distribution of the workload over the available output cores, three factors have to be considered. First, all of the data belonging to the same beam has to be processed by output cores in the same pset, in order to ensure that its I/O node can concatenate all of the 0.25 second chunks that belong to the beam. Second, the maximal output rate of the I/O node has to be considered. As mentioned in Section \ref{Sec:Networks}, an I/O node can output 1.1~Gb/s if it is processing station input data, and 3.1~Gb/s if it is not. However, a complex-voltages beam is 6.2~Gb/s. We therefor optionally split each beam into two or more substreams, each of which will be output by a separate I/O node and each of which will eventually end up as a seperate file on disk. For example, we store the X and Y polarisation data of a complex-voltages beam in two separate files. If there are not enough I/O nodes which do not process station input data, the X and Y polarisations have to be split further, in which case we store 82 or 83 subbands per file, with 3 files per polarisation.
+
+A third factor in the scheduling of the output cores is the presence of the first all-to-all-exchange, which uses the same 3D-torus network at 200~Gb/s. The amount of bandwidth used by the second all-to-all exchange is limited to the amount of data that can be produced, which in turn is limited by our storage cluster: 80~Gb/s. Careful planning of the workload distribution is necessary, because the 3D-torus network routes data over other compute nodes which are quite possibly also communicating at the same time. Some network links in the BG/P will be overloaded unless the output cores are either carefully distributed over the available psets, or unless dedicated psets are used for the output cores. 
+
 \comment{
   Pulsar pipeline (include picture):
      - 1st transpose
@@ -204,6 +246,12 @@ To perform beamforming, the compute nodes need chunks from all stations. Unfortu
 
 \section{Results}
 
+\begin{figure}[ht]
+\includegraphics[width=\textwidth]{stations-beams.pdf}
+\label{fig:stations-beams}
+\caption{The maximum number of beams that can be created in various configurations.}
+\end{figure}
+
 \comment{
 Graphs:
   - #stations vs #beams in various modes