From bb4204ba634a3dc6e7a6682e037163b92d5d1ad5 Mon Sep 17 00:00:00 2001
From: Rob van Nieuwpoort <nieuwpoort@astron.nl>
Date: Tue, 23 Jun 2009 15:06:53 +0000
Subject: [PATCH] Bug 1198: s4 done

---
 doc/papers/2010/SPM/spm.tex | 88 +++++++++++++++++--------------------
 1 file changed, 41 insertions(+), 47 deletions(-)

diff --git a/doc/papers/2010/SPM/spm.tex b/doc/papers/2010/SPM/spm.tex
index 90e1bad5204..9975cdac110 100644
--- a/doc/papers/2010/SPM/spm.tex
+++ b/doc/papers/2010/SPM/spm.tex
@@ -149,6 +149,17 @@ to other instruments.
 
 \section{Trends in radio astronomy}
 
+%% @@@
+%% It is important that the authors take a tutorial 
+%% oriented style and more carefully introduce the 
+%% application context, including radio-astronomy 
+%% basics, instruments that they use (including 
+%% installation roadmap).  The algorithm is quite 
+%% simple and so the strength of the paper lies in 
+%% the thoroughness of the analysis, and the 
+%% aforementioned tutorial background.
+%% @@@
+
 %- signal processing neemt een dominantere rol (meer antennes, etc)
 %- voorbeelden. pathfinders voor SKA. 
 %- computationally intensive, SKA even more
@@ -160,10 +171,6 @@ rely less on concrete, steel, and extreme cooling techniques, but more on
 signal-processing techniques.
 For example, LOFAR~\cite{Butcher:04,deVos:09} is a distributed sensor network
 that combines the signals of tens of thousands of simple receiver elements.
-%Unlike traditional telescopes, that typically use custom-built hardware to
-%process data, LOFAR uses programmable FPGAs for on-the-field station
-%processing and a Blue Gene/P supercomputer to process data centrally, in real
-%time.
 Also, Aperture Array tiles like Embrace~\cite{?} and Focal Plane Arrays
 like Apertif~\cite{?} are novel multi-receiver concepts that require huge
 amounts of processing power to combine the data from the receiving elements.
@@ -173,37 +180,20 @@ and multiple, concurrent observation directions.
 
 %@@@ later even kijken: computing advances maken nieuwe signal processing technieken en telescopen / instrumenten mogelijk.
 
-
-
-%% @@@
-%% It is important that the authors take a tutorial 
-%% oriented style and more carefully introduce the 
-%% application context, including radio-astronomy 
-%% basics, instruments that they use (including 
-%% installation roadmap).  The algorithm is quite 
-%% simple and so the strength of the paper lies in 
-%% the thoroughness of the analysis, and the 
-%% aforementioned tutorial background.
-%% @@@
-
 %% SKA + pathfinders: EMBRACE, LOFAR, ASKAP, meerKAT
 
-
-
-
-
-The signal-processing hardware technology used to process telescope data
-also changes rapidly.
-Only a decade ago, correlators required special-purpose ASICs to keep up with
-the high data rates and processing requirements.
-The advent of sufficiently fast FPGAs significantly lowered the developments
-times and costs of newer-generation correlators, and increased the flexibility
-substantially.
-LOFAR requires even more flexibility to support many different processing
-pipelines for various observation modes, and uses a Blue Gene/P supercomputer
-to perform real-time, central processing.
-
-GPUs seem to be a viable complement to the aforementioned processing platforms.
+The signal-processing hardware technology used to process telescope
+data also changes rapidly.  Only a decade ago, correlators required
+special-purpose ASICs to keep up with the high data rates and
+processing requirements.  The advent of sufficiently fast FPGAs
+significantly lowered the developments times and costs of
+newer-generation correlators, and increased the flexibility
+substantially.  LOFAR requires even more flexibility to support many
+different processing pipelines for various observation modes, and uses
+FPGAs for on-the-field station processing and a Blue Gene/P
+supercomputer to perform real-time, central processing.
+
+Recent many-core architectures seem to be a viable complement to the aforementioned processing platforms.
 GPUs provide more processing power and are more power-efficient than CPUs,
 while GPUs are more flexible and easier to program than FPGAs.
 Since GPUs of different vendors are mutually quite different, we did an
@@ -291,16 +281,19 @@ since we need this later in the pipeline for calibration purposes.
 The autocorrelations can be computed with half the number of instructions.
 
 We can implement the correlation operation very efficiently, with only
-four fused-multily-add (fma) instructions, doing eight floating-point operations in
-total. For each pair of receivers, we have to do this four times, once
-for each combination of polarizations. Thus, in total we need 32
-operations. To perform these operations, we have to load the samples generated by two different receivers from memory.
-As explained above, the samples each consist of four single precision floating point numbers (a real and imaginary part, and two polarizations).
-Therefore, we need to load 8 floats or 32 bytes in total.
-This results in \emph{exactly one FLOP/byte}.  The number of operations that is performed per byte
-that has to be loaded from main memory is called the \emph{arithmetic intensity}~\cite{system-performance}. 
-For the correlation algorithm,
-the arithmetic intensity is extremely low.
+four fused-multily-add (fma) instructions, doing eight floating-point
+operations in total. For each pair of receivers, we have to do this
+four times, once for each combination of polarizations. Thus, in total
+we need 32 operations. To perform these operations, we have to load
+the samples generated by two different receivers from memory.  As
+explained above, the samples each consist of four single precision
+floating point numbers (a real and imaginary part, and two
+polarizations).  Therefore, we need to load 8 floats or 32 bytes in
+total.  This results in \emph{exactly one FLOP/byte}.  The number of
+operations that is performed per byte that has to be loaded from main
+memory is called the \emph{arithmetic
+  intensity}~\cite{system-performance}.  For the correlation
+algorithm, the arithmetic intensity is extremely low.
 
 
 
@@ -380,10 +373,11 @@ a summary of the most important similarities and differences for signal processi
 \subsection{General Purpose multi-core CPU (Intel Core i7 920)}
 
 As a reference, we implemented the correlator on a multi-core
-general-purpose architecture, in this case an Intel core~i7.  The theoretical peak performance of the
-system is 85~gflops, in single precision.  The parallelism comes from
-four cores with two-way hyperthreading, and a vector length of four
-floats, provided by the SSE4 instruction set.
+general-purpose architecture, in this case an Intel core~i7.  The
+theoretical peak performance of the system is 85~gflops, in single
+precision.  The parallelism comes from four cores with two-way
+hyperthreading, and a vector length of four floats, provided by the
+SSE4 instruction set.
 
 SSE4 does not provide fused multiply-add instructions, but the Core~i7
 issues vector-multiply and vector-add instructions concurrently in
-- 
GitLab