Skip to content
Snippets Groups Projects
Commit 4b07ac28 authored by Rob van Nieuwpoort's avatar Rob van Nieuwpoort
Browse files

Bug 1198: intro

parent 755b9762
No related branches found
No related tags found
No related merge requests found
...@@ -49,7 +49,7 @@ ...@@ -49,7 +49,7 @@
\usepackage{graphicx} \usepackage{graphicx}
\usepackage{listings} \usepackage{listings}
\title{How to Build a Correlator on Many-Core Hardware} \title{How to Build a Correlator with Many-Core Hardware}
\name{Rob V. van Nieuwpoort and John W. Romein} \name{Rob V. van Nieuwpoort and John W. Romein}
...@@ -81,6 +81,7 @@ The stations of the \emph{Low Frequency Array ...@@ -81,6 +81,7 @@ The stations of the \emph{Low Frequency Array
(LOFAR)\/}~\cite{Butcher:04,deVos:09}, for instance, will produce some tens of (LOFAR)\/}~\cite{Butcher:04,deVos:09}, for instance, will produce some tens of
petabits per day; the dishes from the Australian SKA Pathfinder (ASKAP) will petabits per day; the dishes from the Australian SKA Pathfinder (ASKAP) will
even produce over six exabits per day. even produce over six exabits per day.
These modern radio telescopes use many seperate receivers as building blocks.
To extract the sky signal from the system noise, the \emph{correlator\/} To extract the sky signal from the system noise, the \emph{correlator\/}
correlates the signals by multiplying the samples of each pair of receivers. correlates the signals by multiplying the samples of each pair of receivers.
Additionally, the correlator integrates correlations over time, to reduce Additionally, the correlator integrates correlations over time, to reduce
...@@ -96,7 +97,16 @@ efficient, consume more power, and are expensive to purchase and maintain. ...@@ -96,7 +97,16 @@ efficient, consume more power, and are expensive to purchase and maintain.
Future instruments, like the Square Kilometre Array (SKA), need several orders Future instruments, like the Square Kilometre Array (SKA), need several orders
of magnitude more computational resources. of magnitude more computational resources.
It is likely that the requirements of the SKA cannot be met by using It is likely that the requirements of the SKA cannot be met by using
current supercomputer technology. current supercomputer technology. Therefore, it is important to investigate
alternative hardware solutions, in particular the many-core architectures.
A recent development is that general-purpose architectures no longer
achieve performance increases by increasing the clock frequency, but
by adding more compute cores and exploiting parallelism. Intel's
recent core~i7 processor is a good example of this. It has four
compute cores, but eight concurrent threads can be used thanks to the
hyperthreading technique. In addition, the cores can expoilt SIMD
paralellism with the SSE4 instruction set.
During the past ten years, the high-performance computing community has During the past ten years, the high-performance computing community has
steadily adopted clusters of Graphics Processor Units (GPUs) as a viable steadily adopted clusters of Graphics Processor Units (GPUs) as a viable
...@@ -114,37 +124,36 @@ High-end GPUs are highly parallel and contain hundreds of processor cores. ...@@ -114,37 +124,36 @@ High-end GPUs are highly parallel and contain hundreds of processor cores.
The IBM Cell Broadband Engine~\cite{Gschwind:06}, well known from the The IBM Cell Broadband Engine~\cite{Gschwind:06}, well known from the
PlayStation~3, is another example of a processor that combines GPU and CPU PlayStation~3, is another example of a processor that combines GPU and CPU
qualities into one design. qualities into one design.
The Cell BE consists of an ``ordinary'' PowerPC core and eight powerful The Cell/B.E. consists of an ``ordinary'' PowerPC core and eight powerful
\emph{Synergistic Processing Elements (SPEs)}, co-processors that provide \emph{Synergistic Processing Elements (SPEs)}, co-processors that provide
the bulk of the processing power. the bulk of the processing power.
The SPEs are vector processors with fast, local memories, and are capable The SPEs are vector processors with fast, local memories, and are capable
of transferring data from and to main memory by means of DMA. of transferring data from and to main memory by means of DMA.
Programming the SPEs requires more effort than programming an ordinary CPU, Programming the SPEs requires more effort than programming an ordinary CPU,
but various studies showed that the Cell BE performs very well on but various studies showed that the Cell/B.E. performs very well on
signal-processing tasks like FFTs~\cite{?}. signal-processing tasks like FFTs~\cite{fftc}.
In this article, we explain how modern multi-core architectures can be In this article, we explain how modern multi-core architectures can be
exploited for signal-processing purposes. exploited for signal-processing purposes. Additionally, we give
Additionally, we give insights into their architectural limitations, and how insights into their architectural limitations, and how to best cope
to best cope with them. with them. We treat five different, popular architectures with
We treat five different, popular multi-core architectures: the IBM Cell BE, multiple cores: the IBM Cell/B.E., GPUs from Nvidia and ATI, the IBM
GPUs from Nvidia and ATI, the IBM Blue Gene/P, and Blue Gene/P supercomputer, and the Intel Core i7 processors. We discuss their
the Intel Core i7 processors. similarities and differences, and how the architectural differences
We discuss their similarities and differences, and how the architectural affect optimization choices and the eventual performance of a
differences affect optimization choices and the eventual performance of a correlator. We strongly focus on correlators, but many of the
correlator. findings, claims, and optimizations hold for other signal-processing
We strongly focus on correlators, but many of the findings, claims, and algorithms as well, both in and outside the area of radio astronomy.
optimizations hold for other signal-processing algorithms as well, both in and outside the We discuss the programmability of each of the architectures, but this
area of radio astronomy. paper should be of special interest to those who are willing to put
We discuss the programmability of each of the architectures, but this paper some extra programming effort to obtain good performance, even if
should be of special interest to those who are willing to put some extra high-level programming support is not available.
programming effort to obtain good performance, even if high-level programming
support is not available. In this paper, we use the LOFAR telescope as a running example, and
compare with the production correlator on the Blue Gene/P. This way,
we can demonstrate how many-cores can be used in practice for a real
In this paper, we use the LOFAR application. Nevertheless, the results apply equally well to other
telescope as an example, but the results apply equally well instruments.
to other instruments.
\section{Trends in radio astronomy} \section{Trends in radio astronomy}
...@@ -205,6 +214,36 @@ purposes~\cite{Nieuwpoort:09}. ...@@ -205,6 +214,36 @@ purposes~\cite{Nieuwpoort:09}.
@@@
LOFAR started as a new and innovative effort to force a breakthrough
in sensitivity for astronomical observations at radio-frequencies
below 250 MHz. The basic technology of radio telescopes had not
changed since the 1960's: large mechanical dish antennas collect
signals before a receiver detects and analyses them. Half the cost of
these telescopes lies in the steel and moving structure. A telescope
100x larger than existing instruments would therefore be
unaffordable. New technology was required to make the next step in
sensitivity needed to unravel the secrets of the early universe and
the physical processes in the centers of active galactic nuclei.
LOFAR is the first telescope of this new sort, using an array of
simple omni-directional antennas instead of mechanical signal
processing with a dish antenna. The electronic signals from the
antennas are digitised, transported to a central digital processor,
and combined in software to emulate a conventional antenna. The cost
is dominated by the cost of electronics and will follow Moore's law,
becoming cheaper with time and allowing increasingly large telescopes
to be built. So LOFAR is an IT-telescope. The antennas are simple
enough but there are a lot of them - 25000 in the full LOFAR
design. To make radio pictures of the sky with adequate sharpness,
these antennas are to be arranged in clusters that are spread out over
an area of ultimately 350 km in diameter. (In phase 1 that is
currently funded 15000 antenna's and maximum baselines of 100 km will
be built). Data transport requirements are in the range of many
Tera-bits/sec and the processing power needed is tens of Tera-FLOPS.
@@@
\section{Correlating signals} \section{Correlating signals}
...@@ -249,8 +288,8 @@ accurate enough for our purposes. From the perspective of the ...@@ -249,8 +288,8 @@ accurate enough for our purposes. From the perspective of the
correlator, samples thus consist of four 32-bit floating point correlator, samples thus consist of four 32-bit floating point
numbers: two polarizations, each with a real and an imaginary part. numbers: two polarizations, each with a real and an imaginary part.
Prior to correlation, an FX correlator must reorder the data that comes from Prior to correlation, the data that comes from
the receivers: the receivers must be reordered:
each input carries the signals of from many frequency subbands from a single each input carries the signals of from many frequency subbands from a single
receiver, but the correlator needs data from a single frequency of all inputs. receiver, but the correlator needs data from a single frequency of all inputs.
Depending on the data rate, switching the data can be a real challenge. Depending on the data rate, switching the data can be a real challenge.
...@@ -384,7 +423,7 @@ issues vector-multiply and vector-add instructions concurrently in ...@@ -384,7 +423,7 @@ issues vector-multiply and vector-add instructions concurrently in
different pipelines, allowing eight flops per cycle per core. One different pipelines, allowing eight flops per cycle per core. One
problem of SSE4 that complicates an efficient correlator is the problem of SSE4 that complicates an efficient correlator is the
limited support for shuffling data within vector registers, unlike the limited support for shuffling data within vector registers, unlike the
Cell~BE, for instance, that can shuffle any byte to any position. Cell/B.E., for instance, that can shuffle any byte to any position.
Also, the number of vector registers is small (sixteen four-word Also, the number of vector registers is small (sixteen four-word
registers). Therefore, the is not much opportunity to reuse data in registers). Therefore, the is not much opportunity to reuse data in
registers; reuse has to come from the L1~data cache. registers; reuse has to come from the L1~data cache.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment