Skip to content
Snippets Groups Projects
Commit bb4204ba authored by Rob van Nieuwpoort's avatar Rob van Nieuwpoort
Browse files

Bug 1198: s4 done

parent e3f05d67
No related branches found
No related tags found
No related merge requests found
...@@ -149,6 +149,17 @@ to other instruments. ...@@ -149,6 +149,17 @@ to other instruments.
\section{Trends in radio astronomy} \section{Trends in radio astronomy}
%% @@@
%% It is important that the authors take a tutorial
%% oriented style and more carefully introduce the
%% application context, including radio-astronomy
%% basics, instruments that they use (including
%% installation roadmap). The algorithm is quite
%% simple and so the strength of the paper lies in
%% the thoroughness of the analysis, and the
%% aforementioned tutorial background.
%% @@@
%- signal processing neemt een dominantere rol (meer antennes, etc) %- signal processing neemt een dominantere rol (meer antennes, etc)
%- voorbeelden. pathfinders voor SKA. %- voorbeelden. pathfinders voor SKA.
%- computationally intensive, SKA even more %- computationally intensive, SKA even more
...@@ -160,10 +171,6 @@ rely less on concrete, steel, and extreme cooling techniques, but more on ...@@ -160,10 +171,6 @@ rely less on concrete, steel, and extreme cooling techniques, but more on
signal-processing techniques. signal-processing techniques.
For example, LOFAR~\cite{Butcher:04,deVos:09} is a distributed sensor network For example, LOFAR~\cite{Butcher:04,deVos:09} is a distributed sensor network
that combines the signals of tens of thousands of simple receiver elements. that combines the signals of tens of thousands of simple receiver elements.
%Unlike traditional telescopes, that typically use custom-built hardware to
%process data, LOFAR uses programmable FPGAs for on-the-field station
%processing and a Blue Gene/P supercomputer to process data centrally, in real
%time.
Also, Aperture Array tiles like Embrace~\cite{?} and Focal Plane Arrays Also, Aperture Array tiles like Embrace~\cite{?} and Focal Plane Arrays
like Apertif~\cite{?} are novel multi-receiver concepts that require huge like Apertif~\cite{?} are novel multi-receiver concepts that require huge
amounts of processing power to combine the data from the receiving elements. amounts of processing power to combine the data from the receiving elements.
...@@ -173,37 +180,20 @@ and multiple, concurrent observation directions. ...@@ -173,37 +180,20 @@ and multiple, concurrent observation directions.
%@@@ later even kijken: computing advances maken nieuwe signal processing technieken en telescopen / instrumenten mogelijk. %@@@ later even kijken: computing advances maken nieuwe signal processing technieken en telescopen / instrumenten mogelijk.
%% @@@
%% It is important that the authors take a tutorial
%% oriented style and more carefully introduce the
%% application context, including radio-astronomy
%% basics, instruments that they use (including
%% installation roadmap). The algorithm is quite
%% simple and so the strength of the paper lies in
%% the thoroughness of the analysis, and the
%% aforementioned tutorial background.
%% @@@
%% SKA + pathfinders: EMBRACE, LOFAR, ASKAP, meerKAT %% SKA + pathfinders: EMBRACE, LOFAR, ASKAP, meerKAT
The signal-processing hardware technology used to process telescope
data also changes rapidly. Only a decade ago, correlators required
special-purpose ASICs to keep up with the high data rates and
processing requirements. The advent of sufficiently fast FPGAs
The signal-processing hardware technology used to process telescope data significantly lowered the developments times and costs of
also changes rapidly. newer-generation correlators, and increased the flexibility
Only a decade ago, correlators required special-purpose ASICs to keep up with substantially. LOFAR requires even more flexibility to support many
the high data rates and processing requirements. different processing pipelines for various observation modes, and uses
The advent of sufficiently fast FPGAs significantly lowered the developments FPGAs for on-the-field station processing and a Blue Gene/P
times and costs of newer-generation correlators, and increased the flexibility supercomputer to perform real-time, central processing.
substantially.
LOFAR requires even more flexibility to support many different processing Recent many-core architectures seem to be a viable complement to the aforementioned processing platforms.
pipelines for various observation modes, and uses a Blue Gene/P supercomputer
to perform real-time, central processing.
GPUs seem to be a viable complement to the aforementioned processing platforms.
GPUs provide more processing power and are more power-efficient than CPUs, GPUs provide more processing power and are more power-efficient than CPUs,
while GPUs are more flexible and easier to program than FPGAs. while GPUs are more flexible and easier to program than FPGAs.
Since GPUs of different vendors are mutually quite different, we did an Since GPUs of different vendors are mutually quite different, we did an
...@@ -291,16 +281,19 @@ since we need this later in the pipeline for calibration purposes. ...@@ -291,16 +281,19 @@ since we need this later in the pipeline for calibration purposes.
The autocorrelations can be computed with half the number of instructions. The autocorrelations can be computed with half the number of instructions.
We can implement the correlation operation very efficiently, with only We can implement the correlation operation very efficiently, with only
four fused-multily-add (fma) instructions, doing eight floating-point operations in four fused-multily-add (fma) instructions, doing eight floating-point
total. For each pair of receivers, we have to do this four times, once operations in total. For each pair of receivers, we have to do this
for each combination of polarizations. Thus, in total we need 32 four times, once for each combination of polarizations. Thus, in total
operations. To perform these operations, we have to load the samples generated by two different receivers from memory. we need 32 operations. To perform these operations, we have to load
As explained above, the samples each consist of four single precision floating point numbers (a real and imaginary part, and two polarizations). the samples generated by two different receivers from memory. As
Therefore, we need to load 8 floats or 32 bytes in total. explained above, the samples each consist of four single precision
This results in \emph{exactly one FLOP/byte}. The number of operations that is performed per byte floating point numbers (a real and imaginary part, and two
that has to be loaded from main memory is called the \emph{arithmetic intensity}~\cite{system-performance}. polarizations). Therefore, we need to load 8 floats or 32 bytes in
For the correlation algorithm, total. This results in \emph{exactly one FLOP/byte}. The number of
the arithmetic intensity is extremely low. operations that is performed per byte that has to be loaded from main
memory is called the \emph{arithmetic
intensity}~\cite{system-performance}. For the correlation
algorithm, the arithmetic intensity is extremely low.
...@@ -380,10 +373,11 @@ a summary of the most important similarities and differences for signal processi ...@@ -380,10 +373,11 @@ a summary of the most important similarities and differences for signal processi
\subsection{General Purpose multi-core CPU (Intel Core i7 920)} \subsection{General Purpose multi-core CPU (Intel Core i7 920)}
As a reference, we implemented the correlator on a multi-core As a reference, we implemented the correlator on a multi-core
general-purpose architecture, in this case an Intel core~i7. The theoretical peak performance of the general-purpose architecture, in this case an Intel core~i7. The
system is 85~gflops, in single precision. The parallelism comes from theoretical peak performance of the system is 85~gflops, in single
four cores with two-way hyperthreading, and a vector length of four precision. The parallelism comes from four cores with two-way
floats, provided by the SSE4 instruction set. hyperthreading, and a vector length of four floats, provided by the
SSE4 instruction set.
SSE4 does not provide fused multiply-add instructions, but the Core~i7 SSE4 does not provide fused multiply-add instructions, but the Core~i7
issues vector-multiply and vector-add instructions concurrently in issues vector-multiply and vector-add instructions concurrently in
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment