bug 1362: now conforms to required style (still 13 pages though and references...

bug 1362: now conforms to required style (still 13 pages though and references are not self-contained)

bug 1362: now conforms to required style (still 13 pages though and references...
88ff12ab · Jan David Mol · efa71a7d · 88ff12ab · 88ff12ab
Commit 88ff12ab authored Apr 15, 2011 by Jan David Mol
--- a/doc/papers/2011/europar/lofar.pdf
+++ b/doc/papers/2011/europar/lofar.pdf
--- a/doc/papers/2011/europar/lofar.tex
+++ b/doc/papers/2011/europar/lofar.tex
@@ -3,6 +3,7 @@
 \usepackage{listings,lstpseudo}
 \usepackage[usenames]{color}
 \usepackage{mathptmx}
+\usepackage{verbatim}

 \newfloat{listing}{th}{lst}
 \floatname{listing}{\bf Listing}
@@ -15,16 +16,15 @@
    \put(0,0){\makebox[9pt]{\fontfamily{phv}\fontseries{b}\selectfont\scriptsize#1}}%
    \end{picture}%
 }
-\renewcommand{\topfraction}{0.9}
-\renewcommand{\textfraction}{0.01}
-\renewcommand{\floatpagefraction}{0.99}
+%\renewcommand{\topfraction}{0.9}
+%\renewcommand{\textfraction}{0.01}
+%\renewcommand{\floatpagefraction}{0.99}

 \begin{document}
-\newcommand{\comment}[1]{}

-\author{Jan David Mol and John W. Romein}
+\author{Jan David Mol \and John W. Romein}
 \title{The LOFAR Beam Former: \\ Implementation and Performance Analysis}
-\institute{Stichting ASTRON (Netherlands Institute for Radio Astronomy) \\ Oude Hoogeveensedijk 4, 7991 PD Dwingeloo, The Netherlands \\ \texttt{\{mol,romein\}@astron.nl}}
+\institute{Stichting ASTRON (Netherlands Institute for Radio Astronomy) \\ Oude Hoogeveensedijk 4, 7991 PD Dwingeloo, The Netherlands \\ \email{\{mol,romein\}@astron.nl}}
 \maketitle

 \begin{abstract}
@@ -47,7 +47,7 @@ In this paper, we will present the LOFAR \emph{beam former}: a collection of sof

 It is LOFAR's unique design that allows us to point at many sources at once. Traditional telescopes use dishes that have a narrow field-of-view: they are only sensitive to a small region around the source they are pointed at. LOFAR's antennas are omnidirectional. Groups of antennas (\emph{stations}) are sensitive to a wide field-of-view around the source. These views, or \emph{station beams}, are sent to the BG/P, that generates weighted additions of the station input data, called \emph{tied-array beams}. Each tied-array beam represents an offset pointing within the wide field-of-view of the stations.

-The primary scientific use case driving the work presented in this paper is pulsar research. A pulsar is a rapidly rotating, highly magnetised neutron star, which emits electromagnetic radiation from its poles. Similar to the behaviour of a lighthouse, the radiation is visible to us only if one of the poles points towards the Earth, and subsequently appears to us as a very regular series of pulses, with a period as low as 1.4~ms. Pulsars are weak radio sources, and their individual pulses often do not rise above the background noise that fills our universe. Our beam former can track several pulsars at LOFAR's full observational bandwidth. Alternatively, the beam former is capable of efficiently performing sky surveys to discover new pulsars (or other radio sources) by covering the sky with hundreds of tied-array beams at a reduced observational bandwidth.
+The primary scientific use case driving the work presented in this paper is pulsar research~\cite{Stappers:11}. A pulsar is a rapidly rotating, highly magnetised neutron star, which emits electromagnetic radiation from its poles. Similar to the behaviour of a lighthouse, the radiation is visible to us only if one of the poles points towards the Earth, and subsequently appears to us as a very regular series of pulses, with a period as low as 1.4~ms. Pulsars are weak radio sources, and their individual pulses often do not rise above the background noise that fills our universe. Our beam former can track several pulsars at LOFAR's full observational bandwidth. Alternatively, the beam former is capable of efficiently performing sky surveys to discover new pulsars (or other radio sources) by covering the sky with hundreds of tied-array beams at a reduced observational bandwidth.

 The main contributions of this paper are threefold. First, we demonstrate the power of a \emph{software\/} telescope; its flexibility allows us to add new functionality with modest effort and we show how the use of supercomputer technology enables new science in astronomy and particle physics. Second, we describe the first system which allows a telescope to be pointed in hundreds of directions. Third, we elaborately analyse the performance of our application and the effectiveness of our optimisations. 

@@ -134,20 +134,15 @@ The beam former combines the chunks from all stations, producing a chunk for eac

 %The delays are applied to the station data through complex multiplications and additions, programmed in assembly. In order to take full advantage of the L1 cache and the available registers, data is processed in sets of 6 stations, producing 3 beams, in portions of 128 samples, or a subset thereof to cover the remainders. While the exact ideal set size in which the data is to be processed is platform specific, we have shown in previous work that similar tradeoffs exist for similar problems across different architectures~\cite{Nieuwpoort:09}.

-\begin{listing}
-\lstset{language=pseudo}
-\begin{lstlisting}{}
+All time-consuming pipeline components are written in assembly, to achieve maximum performance.  The assembly code minimises the number of memory accesses, minimises load delays, minimises FPU pipeline stalls, and maximises instruction-level parallelism.  We learnt that optimal performance is often achieved by combining multiple iterations of a multi-dimensional loops:
+\begin{verbatim}
 FOR Channel IN 1 .. NrChannels DO
  FOR Station IN 1 .. NrStations STEP 6 DO
    FOR Time IN 1 .. NrTimes STEP 128 DO
      FOR Beam IN 1 .. NrBeams STEP 3 DO
        BeamForm6StationsAnd128TimesTo3BeamsAssembly(...)
-\end{lstlisting}
-\caption{Pseudo code for the processing loops around the beam former assembly.}
-\label{lst:beam-forming}
-\end{listing}        
-
-All time-consuming pipeline components are written in assembly, to achieve maximum performance.  The assembly code minimises the number of memory accesses, minimises load delays, minimises FPU pipeline stalls, and maximises instruction-level parallelism.  We learnt that optimal performance is often achieved by combining multiple iterations of a multi-dimensional loops like shown in Listing~\ref{lst:beam-forming}. This is much more efficient than to create all beams one at a time, due to better reuse of data loaded from main memory.  Finding the most efficient way to group work is a combination of careful analysis and, unfortunately, trial-and-error. The coherent beam former achieves 86\% of the FPU peak performance, not as high as the 96\% of the correlator~\cite{Romein:10a}, but still 16 times more than the C++ reference implementation. 
+\end{verbatim}
+This is much more efficient than to create all beams one at a time, due to better reuse of data loaded from main memory.  Finding the most efficient way to group work is a combination of careful analysis and, unfortunately, trial-and-error. The coherent beam former achieves 86\% of the FPU peak performance, not as high as the 96\% of the correlator~\cite{Romein:10a}, but still 16 times more than the C++ reference implementation. 
 %Because each beam is an accumulation of the data from all stations, the bandwidth of each beam is equal to the bandwidth of data from a single station, which is 6.2~Gb/s now that the samples are 32-bit floats. Once the beams are formed, they are kept as complex voltages or transformed into the Stokes IQUV or the Stokes I parameters. In the latter case, the beams can also be integrated temporally to reduce the resulting data rate. Finally, an incoherent beam can be created in parallel, and converted into either Stokes I or Stokes IQUV parameters.

 %Our beam former supports several pipelines: \emph{complex voltages}, \emph{Stokes IQUV}, and \emph{Stokes I}. The complex voltages pipeline outputs the raw tied-array beams, which consist of two 3.1~Gb/s streams of 32-bit complex floating points numbers (floats), one stream for each polarisation. The Stokes IQUV pipeline applies a domain transformation to each sample of the raw tied-array beams, which is useful for polarisation-related studies. The four Stokes parameters, calculated through $I = X\overline{X} + Y\overline{Y}$, $Q = X\overline{X} - Y\overline{Y}$, $U = 2\mathrm{Re}(X\overline{Y})$, $V = 2\mathrm{Im}(X\overline{Y})$, with each parameter being a 32-bit float, resulting in four 1.5~Gb/s streams. The Stokes I pipeline calculates only the first Stokes parameter, which represents the signal strength in both polarisations. The Stokes I pipeline supports temporal integration to trade time resolution for a reduced bandwidth per beam, allowing more beams to be created.
@@ -233,22 +228,25 @@ In the Stokes I pipeline, we applied several integration factors (1, 2, 4, 8, an
 \end{figure}

 \begin{table}[t]
+\caption{Several highlighted cases.}
+\label{table:cases}
 \center
-\begin{tabular}{l|l|r|r|r|r|r|r|l|l}
+\begin{tabular}{llrrrrrrll}
+\hline\noalign{\smallskip}
 Case & Mode & Channel & Int. & Stations & Beams  & Input & Output & Bound & Used for \\
     &      & dedisp. & factor      &          &        & rate  & rate   &       & \\
+\noalign{\smallskip}     
 \hline
-\hline
+\noalign{\smallskip}     
 \circlenumber{A} & Stokes I    & N & 16 &  4 & 450 &  12 Gb/s & 44 Gb/s & Torus & Surveys \\
 \circlenumber{B} & Stokes I    & N & 16 & 24 & 310 &  74 Gb/s & 30 Gb/s & CPU   & Surveys \\
 \circlenumber{C} & Stokes I    & N &  8 & 64 & 155 & 198 Gb/s & 30 Gb/s & CPU   & Surveys \\  
 \circlenumber{D} & Stokes IQUV & Y & - & 24 &  13 &  74 Gb/s & 81 Gb/s & I/O   & Known sources \\
 \circlenumber{E} & Stokes IQUV & Y & - & 64 &  10 & 198 Gb/s & 62 Gb/s & I/O   & Known sources \\
-\circlenumber{F} & Stokes I    & Y & 1 & 64 &  42 & 198 Gb/s & 65 Gb/s & I/O   & Known sources 
+\circlenumber{F} & Stokes I    & Y & 1 & 64 &  42 & 198 Gb/s & 65 Gb/s & I/O   & Known sources \\
+\hline
 \end{tabular}
-\caption{Several highlighted cases.}
-\label{table:cases}
-\vspace{-0.7cm}
+%\vspace{-0.7cm}
 \end{table}