Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
LOFAR
Manage
Activity
Members
Labels
Plan
Issues
Wiki
Jira issues
Open Jira
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Locked files
Deploy
Releases
Package registry
Container Registry
Model registry
Operate
Environments
Terraform modules
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Code review analytics
Insights
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
RadioObservatory
LOFAR
Commits
b4206617
Commit
b4206617
authored
15 years ago
by
Rob van Nieuwpoort
Browse files
Options
Downloads
Patches
Plain Diff
Bug 1198: s5-end
parent
b18bc51b
No related branches found
Branches containing commit
No related tags found
Tags containing commit
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/papers/2010/SPM/spm.tex
+52
-91
52 additions, 91 deletions
doc/papers/2010/SPM/spm.tex
with
52 additions
and
91 deletions
doc/papers/2010/SPM/spm.tex
+
52
−
91
View file @
b4206617
...
...
@@ -721,8 +721,8 @@ autocorrelations.
For example, the samples from receivers 8, 9, 10, and 11 can be correlated
with the samples from receivers 4, 5, 6, and 7 (the red square in the figure),
reusing each fetched sample four times.
This way, eight samples are read from memory for
16
multiplications, reducing the amount of memory operations
already
by a factor
This way, eight samples are read from memory for
sixteen
multiplications, reducing the amount of memory operations by a factor
of four.
Correlating even higher numbers of receivers simultaneously would reduce the
memory bandwidth usage further, but the maximum number of receivers that can
...
...
@@ -739,24 +739,24 @@ architecture.
%On the one hand, correlating and integrating over long periods of time
%is good for pipelined FPU operation, on the other hand, the
Even when dividing the correlation triangle in tiles t
here still is
T
here still is
opportunity for additional data reuse
\emph
{
between
}
tiles. The tiles
within a row or column in the triangle still need the same samples.
In addition to registers, caches can thus also be used to increase
data reuse.
It is important to realize that the
correlator itself is
\emph
{
trivially parallel
}
, since tens of thousands of
frequency channels can be processed independently. This allows us to
correlator itself is
\emph
{
trivially parallel
}
, since
the
tens of thousands of
frequency channels
that LOFAR uses
can be processed independently. This allows us to
efficiently exploit many-core hardware.
We will now describe the implementation of the correlator on
the different architectures.
We evaluate the performance in detail. For comparison reasons, we use the performance
\emph
{
per chip
}
for each architecture.
We choose 64 as the number of receivers, since
We choose 64 as the number of receivers
(each consisting of hundreds of antennas)
, since
that is a realistic number for LOFAR. Future instruments will likely
have even more receivers.
have even more receivers.
The performance results are shown in Figure~
\ref
{
performance-graph
}
.
\begin{figure*}
[t]
\begin{center}
...
...
@@ -768,7 +768,7 @@ have even more receivers.
\end{figure*}
\subsection
{
Intel
}
\subsection
{
Intel
CPU
}
We use the SSE4 instruction set to exploit vector parallelism. Due to
the limited shuffle instructions, computing the correlations of the
...
...
@@ -778,14 +778,12 @@ found that, unlike on the other platforms, computing four samples with
subsequent time stamps in a vector works better. The use of SSE4
improves the performance by a factor of 3.6 in this case. In
addition, we use multiple threads to utilize all four cores. To
benefit from hyperthreading, we need twice as many threads as cores
(i.e., 8 in our case). Using more threads does not
help. Hyperthreading increases performance by 6
\%
. The most efficient
version uses a tile size of
$
2
\times
2
$
\.
Larger tile sizes are inefficient
due to the small SSE4 register file. We achieve a performance of 48.0
gflops, 67
\%
of the peak, while using 73
\%
of the peak bandwidth.
benefit from hyperthreading, we need twice as many threads as cores.
Hyperthreading increases performance by 6
\%
. The most efficient
version uses a tile size of
$
2
\times
2
$
. Larger tile sizes are inefficient
due to the small number of SSE4 registers.
\subsection
{
BG/P
}
\subsection
{
BG/P
supercomputer
}
The LOFAR production correlator is implemented on the Blue Gene/P platform.
We use it as the reference for performance comparisons.
...
...
@@ -793,20 +791,20 @@ The (assembly) code hides load and
instruction latencies, issues concurrent floating point, integer, and
load/store instructions, and uses the L2 prefetch buffers in the most
optimal way.
We use a cell size of
$
2
\times
2
$
, since this offers the highest
level of reuse, whil
e s
ti
ll
fitting in the
register
file.
Like on the Intel CPU, we have to use a cell size of
$
2
\times
2
$
due to
th
e s
ma
ll
number of
register
s.
The performance we achieve with this version is 13.1 gflops per chip,
96
\%
of the theoretical peak performance.
The problem is compute bound, and not I/O bound, thanks to the
high memory bandwidth per flop.
For more information, we refer to~
\cite
{
sc09
}
.
%
For more information, we refer to~\cite{sc09}.
\subsection
{
ATI
}
\subsection
{
ATI
GPU
}
ATI offers two separate programming models, at different abstraction
levels. The low-level programming model is called the ``Compute
Abstraction Layer'' (CAL). CAL provides communication primitives and
an
intermediate
assembly language, allowing fine-tuning of device
an assembly language, allowing fine-tuning of device
performance. For high-level programming, ATI adopted
\emph
{
Brook
}
,
which was originally developed at Stanford~
\cite
{
brook
}
. ATI's
extended version is called
\emph
{
Brook+
}
~
\cite
{
amd-manual
}
. We
...
...
@@ -816,20 +814,14 @@ With both Brook+ and CAL, the programmer has to do the vectorization,
unlike with NVIDIA GPUs. CAL provides a feature called
\emph
{
swizzling
}
, which is used to select parts of vector registers in
arithmetic operations. We found this improves readability of the code
significantly. Unlike the other architectures, the ATI GPUs are not
well documented. Essential information, such as the number of
registers, cache sizes, and memory architecture is missing, making it
hard to write optimized code. Although the situation improved
recently, the documentation is still inadequate. Moreover, the
programming tools are insufficient. The high-level Brook+ model does
significantly. However, the
programming tools still are insufficient. The high-level Brook+ model does
not achieve acceptable performance for our application. The low-level
CAL model does, but it is difficult to use.
The architecture also does not provide random write access to device
memory. The kernel output can be written to at most 8 output registers
(each 4 floats wide). The hardware stores these to predetermined
locations in device memory. When using the output registers, at most
32 floating point values can be stored. This effectively limits the
(each 4 floats wide). This effectively limits the
tile size to
$
2
\times
2
$
. Random write access to
\emph
{
host
}
memory is
provided. The correlator reduces the data by a large amount, and the
results are never reused by the kernel. Therefore, they can be
...
...
@@ -837,8 +829,8 @@ directly streamed to host memory.
The best performing implementation uses a tile size of 4x3, thanks to
the large number of registers. The kernel itself achieves 297 gflops,
which is 25
\%
of the theoretical peak performance. The
achieved device
memory bandwidth
is 81~GB/s, which is 70
\%
of the theoretical maximum
.
which is 25
\%
of the theoretical peak performance. The
performance is limited by
the device
memory bandwidth.
If we also take the host-to-device transfers into account, performance
becomes much worse. We found that the host-to-device throughput is
...
...
@@ -846,14 +838,11 @@ only 4.62 GB/s in practice, although the theoretical PCI-e bus
bandwidth is 8 GB/s. The transfer can be done asynchronously,
overlapping the computation with host-to-device communication.
However, we discovered that the performance of the compute kernel
decreases significantly if transfers are performed concurrently. For
the
$
4
\times
3
$
case, the compute kernel becomes 3.0 times slower,
which can be fully attributed to the decrease of device memory
throughput. Due to the low I/O performance, we achieve only 171
gflops, 14
\%
of the theoretical peak.
decreases significantly if transfers are performed concurrently.
Due to the low I/O performance, we achieve only 14
\%
of the theoretical peak.
\subsection
{
NVIDIA
}
\subsection
{
NVIDIA
GPU
}
NVIDIA's programming model is called Cuda~
\cite
{
cuda-manual
}
.
Cuda is relatively high-level, and achieves good performance.
...
...
@@ -863,8 +852,7 @@ An advantage of NVIDIA hardware and Cuda is that the application does not have t
vectorization. This is thanks to the fact that all cores have their own address generation units.
All data parallelism is expressed by using threads.
The correlator uses 128-bit reads to load a complex sample with two
polarizations with one instruction. Since random write access to
Since random write access to
device memory is supported (unlike with the ATI hardware), we can
simply store the output correlations to device memory. We use the
texture cache to speed-up access to the sample data. We do not use it for the
...
...
@@ -873,7 +861,7 @@ With Cuda, threads
within a thread block can be synchronized. We exploit this feature to let
the threads that access the same samples run in lock step. This way,
we pay a small synchronization overhead, but we can increase the cache hit
ratio significantly. We found that this optimization improved performance by a factor of 2.
0.
ratio significantly. We found that this optimization improved performance by a factor of 2.
We also investigated the use of the per-multiprocessor shared memory as an
application-managed cache. Others report good results with this
...
...
@@ -888,26 +876,15 @@ The register file is a shared resource. A smaller tile size means less register
which allows the use of more concurrent threads, hiding load delays.
On NVIDIA hardware, we found that the using a relatively small tile size and many threads increases performance.
The kernel itself, without host-to-device communication achieves 285
gflops, which is 31
\%
of the theoretical peak performance. The
achieved device memory bandwidth is 110~GB/s, which is 108
\%
of the
theoretical maximum. We can reach more than 100
\%
because we include data reuse.
The performance we get with the correlator is significantly
improved thanks to this data reuse, which we achieve by exploiting the texture cache.
The advantage is large, because separate bandwidth tests show that the theoretical
bandwidth cannot be reached in practice. Even in the most optimal case, only 71
\%
(72 GB/s) of the
theoretical maximum can be obtained.
If we include communication, the performance
drops by 15
\%
, and we only get 243 gflops. Just like with the ATI hardware,
this is caused by the low PCI-e bandwidth.
With NVIDIA hardware and our data-intensive kernel, we do see significant
performance gains by using asynchronous I/O. With synchronous I/O, we achieve only
153 gflops. Therefore, the use of asynchronous I/O is essential.
The kernel itself, without host-to-device communication achieves 31
\%
of the theoretical peak performance. If we include communication, the
performance drops to 26
\%
of the peak. Just like with the ATI
hardware, this is caused by the low PCI-e bandwidth. With NVIDIA
hardware and our data-intensive kernel, we do see significant
performance gains by using asynchronous I/O.
\subsection
{
Cell
}
\subsection
{
Cell/B.E.
}
The basic
\mbox
{
Cell/B.E.
}
programming is based on multi-threading:
the PPE spawns threads that execute asynchronously on SPEs.
...
...
@@ -921,7 +898,7 @@ transfers~\cite{cell}. The \mbox{Cell/B.E.} can be
programmed in C or C++, while using intrinsics to exploit vector
parallelism.
The large number of registers
(128 times 4 floats)
allows a big tile size of
The large number of registers allows a big tile size of
$
4
\times
3
$
, leading to a lot of data reuse.
We exploit the vector parallelism of the
\mbox
{
Cell/B.E.
}
by computing the four
polarization combinations in parallel. We found that this performs
...
...
@@ -931,16 +908,6 @@ The shuffle instruction is executed
in the odd pipeline, while the arithmetic is executed in the even
pipeline, allowing them to overlap.
We identified a minor performance problem with the pipelines of the
\mbox
{
Cell/B.E.
}
Regrettably, there is no (auto)increment instruction in the odd
pipeline. Therefore, loop counters and address calculations have to
be performed on the critical path, in the even pipeline. In the time
it takes to increment a simple loop counter, four multiply-adds, or 8
flops could have been performed. To circumvent this, we performed loop
unrolling in our kernels. This solves the performance problem, but has
the unwanted side effect that it uses local store memory, which is
better used as data cache.
A distinctive property of the architecture is that cache transfers are
explicitly managed by the application, using DMA. This is unlike other
architectures, where caches work transparently.
...
...
@@ -960,11 +927,9 @@ Although issuing explicit DMA commands complicates programming,
for our application this is not problematic.
Due to the high
memory bandwidth and the ability to reuse data, we achieve 187
gflops, including all memory I/O. This is 92
\%
of the peak
performance on one chip. If we use both chips in the cell blade, the
performance drops only with a small amount, and we still achieve
91
\%
(373 gflops) of the peak performance. Even though the memory
memory bandwidth and the ability to reuse data, we achieve 92
\%
of the peak
performance on one chip. If we use both chips in the cell blade, we still achieve
91
\%
. Even though the memory
bandwidth per operation of the
\mbox
{
Cell/B.E.
}
is eight times lower than
that of the BG/P, we still achieve excellent performance, thanks to
the high data reuse factor.
...
...
@@ -979,13 +944,13 @@ the high data reuse factor.
\begin{tabular}
{
l|l|l|l|l
}
Intel Core i7 920
&
IBM Blue Gene/P
&
ATI 4870
&
NVIDIA Tesla C1060
&
STI Cell/B.E.
\\
\hline
+ well-known
&
+ L2 prefetch unit
works well
&
+ largest number of cores
&
+ random write access
&
+ explicit cache
\\
+ well-known
&
+ L2 prefetch unit
&
+ largest number of cores
&
+ random write access
&
+ explicit cache
\\
&
+ high memory bandwidth
&
+ swizzling support
&
+ Cuda is high-level
&
+ random write access
\\
&
&
&
&
+ shuffle capabilities
\\
&
&
&
&
+ power efficiency
\\
&
&
&
&
\\
- few registers
&
-
everything
double precision
&
- low PCI-e bandwidth
&
- low PCI-e bandwidth
&
- multiple parallelism levels
\\
- no fma
&
- expensive
&
- transfer slows down kernel
&
&
- no increment in odd pipe
\\
- few registers
&
- double precision
only
&
- low PCI-e bandwidth
&
- low PCI-e bandwidth
&
- multiple parallelism levels
\\
- no fma
instruction
&
- expensive
&
- transfer slows down kernel
&
&
- no increment in odd pipe
\\
- limited shuffling
&
&
- no random write access
&
&
\\
&
&
- bad Brook+ performance
&
&
\\
&
&
- CAL is low-level
&
&
\\
...
...
@@ -1008,9 +973,7 @@ to that of the BG/P. Although the theoretical peak performance of the
performance is only slightly less. If both chips in the QS21 blade
are used, the
\mbox
{
Cell/B.E.
}
also has the highest absolute
performance. For the GPUs, it is possible to use more than one chip as
well. This can be done in the form of multiple PCI-e cards, or with
two chips on a single card, as is done with the ATI 4870x2
device. However, we found that this does not help, since the
well, for instance with the ATI 4870x2 device. However, we found that this does not help, since the
performance is already limited by the low PCI-e throughput, and the
chips have to share this resource.
The graph indeed shows that the
...
...
@@ -1024,7 +987,7 @@ results are applicable to signal processing applications in
general.
\section
{
Programmability
}
\section
{
Programmability
of the platforms
}
The performance gap between assembly and a high-level programming language
is quite different for the different platforms. It also
...
...
@@ -1034,11 +997,8 @@ etc., up to a level that the C code becomes almost as low-level as assembly
code. The difference varies between only a few percent to a factor of 10.
For the BG/P, the performance from compiled C++ code was by far not
sufficient. The assembly version hides load and instruction
latencies, issues concurrent floating point, integer, and load/store
instructions, and uses the L2 prefetch buffers in the most optimal
way. The resulting code is approximately 10 times faster than C++
code. For both the Cell/B.E. and the Intel core~i7, we found that
sufficient. The assembly code is approximately 10 times faster.
For both the Cell/B.E. and the Intel core~i7, we found that
high-level code in C or C++ in combination with the use of intrinsics
to manually describe the SIMD parallelism yields acceptable
performance compared to optimized assembly code. Thus, the programmer
...
...
@@ -1052,6 +1012,7 @@ found that the high-level Brook+ model does not achieve acceptable
performance compared to hand-written CAL code. Manually written assembly
is more than three times faster. Also, the Brook+ documentation is insufficient.
\section
{
Aplying the techniques: a case study with the Intel Larrabee
}
Intel recently disclosed some details about the upcoming Larrabee processor,
...
...
@@ -1069,11 +1030,11 @@ One option is to operate on 16~samples with consecutive time stamps.
A minor drawback is that the data must be ``horizontally'' added to integrate,
but this can be done outside the main loop.
Another option is to operate on samples from 16~consecutive frequencies.
An advantage of this may be that the input is in the right order (i.e.,
the 16~values can be read from consecutive memory locations) if a Poly-Phase
Filter precedes the correlator: the FFT outputs consecutive frequencies into
consecutive memory locations.
Both
%%
An advantage of this may be that the input is in the right order (i.e.,
%%
the 16~values can be read from consecutive memory locations) if a Poly-Phase
%%
Filter precedes the correlator: the FFT outputs consecutive frequencies into
%%
consecutive memory locations.
%%
Both
Another option is to correlate samples from different receivers as illustrated
by Figure~
\ref
{
fig-correlation
}
.
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment