Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
LOFAR
Manage
Activity
Members
Labels
Plan
Issues
Wiki
Jira issues
Open Jira
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Locked files
Deploy
Releases
Package registry
Container Registry
Model registry
Operate
Environments
Terraform modules
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Code review analytics
Insights
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
RadioObservatory
LOFAR
Commits
0f555d1e
Commit
0f555d1e
authored
15 years ago
by
John Romein
Browse files
Options
Downloads
Patches
Plain Diff
bug 225:
parent
daac3330
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/papers/2010/SPM/spm.tex
+34
-47
34 additions, 47 deletions
doc/papers/2010/SPM/spm.tex
with
34 additions
and
47 deletions
doc/papers/2010/SPM/spm.tex
+
34
−
47
View file @
0f555d1e
...
@@ -266,6 +266,17 @@ the arithmetic intensity is extremely low.
...
@@ -266,6 +266,17 @@ the arithmetic intensity is extremely low.
Prior to correlation, an FX correlator must reorder the data that comes from
the receivers:
each input carries the signals of from many frequency subbands from a single
receiver, but the correlator correlates
Depending on the data rate, switching the data can be a real challenge.
The data reordering phase is outside the scope of this paper, but a correlator
implementation cannot ignore this issue.
The LOFAR Blue Gene/P correlator uses the fast 3-D~torus for this purpose;
other multi-core architectures need external switches.
\section
{
Many-core architectures
}
\section
{
Many-core architectures
}
In this section, we briefly explain key properties of six different
In this section, we briefly explain key properties of six different
...
@@ -309,67 +320,43 @@ imaginary values.
...
@@ -309,67 +320,43 @@ imaginary values.
\subsection
{
General Purpose multi-core CPU (Intel Core i7 920)
}
\subsection
{
General Purpose multi-core CPU (Intel Core i7 920)
}
As a reference, we implemented the correlator on a multi-core general-purpose
As a reference, we implemented the correlator on a multi-core general-purpose
architecture.
architecture.
We use a quad core Intel Core~i7 920 CPU (code name Nehalem) at 2.67~GHz.
The theoretical peak performance of the system is 85~gflops, in single
There is 32~KB of on-chip L1 data cache per core, 256~KB L2 cache per core, and 8~MB
precision.
of shared L3 cache.
The thermal design power (TDP) is 130~Watts.
The theoretical
peak performance of the system is 85~gflops, in single precision.
The parallelism comes from four cores with two-way hyperthreading, and a vector length of four floats,
The parallelism comes from four cores with two-way hyperthreading, and a vector length of four floats,
provided by the SSE4 instruction set.
provided by the SSE4 instruction set.
SSE4 does not provide fused multiply-add instructions, but the Core~i7 issues
\subsection
{
General Purpose multi-core CPU
}
vector-multiply and vector-add instructions concurrently in different pipelines,
As a reference, we implemented the correlator on a multi-core general
purpose architecture, a quad core Intel Core~i7 CPU. There is 32~KB
of on-chip L1 data cache per core, 256~KB L2 cache per core, and 8~MB
of shared L3 cache. The theoretical peak performance of the system is
85~gflops, in single precision. The parallelism comes from four cores
with two-way hyperthreading, and a vector length of four floats,
provided by the SSE4 instruction set.
The architecture has several important drawbacks for our application.
First, there is no fused multiply-add instruction. Since the
correlator performs mostly multiplies and adds, this can cause a
performance penalty. The processor does have multiple pipelines, and
the multiply and add instructions are executed in different pipelines,
allowing eight flops per cycle per core.
allowing eight flops per cycle per core.
One problem of SSE4 that complicates an efficient correlator is the limited
Another problem is that SSE's shuffle instructions to move data around
support for shuffling data within vector registers, unlike the Cell~BE, for
in vector registers are more limited than for instance on the
instance, that can shuffle any byte to any position.
\mbox
{
Cell/B.E.
}
processor. This complicates an efficient
Also, the number of vector registers is small (sixteen four-word registers).
implementation. For the future Intel Larrabee GPU, and for the next
Therefore, the is not much opportunity to reuse data in registers; reuse
generation of Intel processors, both a fused multiply-add instruction
has to come from the L1~data cache.
and improved shuffle support has been announced. The number of SSE
Consequently, the correlator uses a small tile size.
registers is small (sixteen 128-bit registers), allowing only little
data reuse. This is a problem for the correlator, since the tile size
is limited by the number of registers. A smaller tile size means less
opportunity for data reuse, increasing the memory bandwidth that is
required.
\subsection
{
IBM Blue Gene/P
}
\subsection
{
IBM Blue Gene/P
}
The IBM Blue Gene/P~(BG/P)~
\cite
{
IBM:08
}
is the architecture that is
The IBM Blue Gene/P~(BG/P)~
\cite
{
IBM:08
}
is the architecture that is
currently used for the LOFAR correlator~
\cite
{
Romein:06,Romein:09b
}
.
currently used for the LOFAR correlator~
\cite
{
Romein:06,Romein:09b
}
.
Four PowerPC
Four PowerPC
processors are integrated on each Blue Gene/P chip.
processors are integrated on each Blue Gene/P chip. The BG/P is an
The BG/P is an energy efficient supercomputer.
energy efficient supercomputer.
This is accomplished by using many
This is accomplished by using many
small, low-power chips, at a low clock
small, low-power chips, at a low clock frequency. The supercomputer
frequency.
also has excellent I/O capabilities, there are five
specialized
The supercomputer
also has excellent I/O capabilities, there are five
networks for communication.
specialized
networks for communication.
We found that the BG/P is extremely suitable for our application,
We found that the BG/P is extremely suitable for our application,
since it is highly optimized for processing of complex numbers.
The
since it is highly optimized for processing of complex numbers.
BG/P performs
\emph
{
all
}
floating point operations in double
The
BG/P performs
\emph
{
all
}
floating point operations in double
precision, which is overkill for our application.
In contrast to all
precision, which is overkill for our application.
other architectures we evaluate, the problem is compute
bound instead
In contrast to all
other architectures we evaluate, the problem is compute
of I/O bound, thanks to the BG/P's high memory bandwidth per
bound instead
of I/O bound, thanks to the BG/P's high memory bandwidth per
operation
. It
is 3--10 times higher than for the other architectures.
operation
, which
is 3--10 times higher than for the other architectures.
The BG/P has 32 vector registers of width 2. Therefore, 64 floating
The BG/P has 32 vector registers of width 2. Therefore, 64 floating
point numbers (with double precision) can be kept in registers
point numbers (with double precision) can be kept in registers
simultaneously. This is the same amount as on the general purpose
simultaneously. This is the same amount as on the general purpose
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment