Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
LOFAR
Manage
Activity
Members
Labels
Plan
Issues
Wiki
Jira issues
Open Jira
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Locked files
Deploy
Releases
Package registry
Container registry
Model registry
Operate
Environments
Terraform modules
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Code review analytics
Insights
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
RadioObservatory
LOFAR
Commits
a85ea645
Commit
a85ea645
authored
15 years ago
by
Rob van Nieuwpoort
Browse files
Options
Downloads
Patches
Plain Diff
Bug 1198: performance
parent
aaa8448c
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/papers/2010/SPM/spm.tex
+81
-3
81 additions, 3 deletions
doc/papers/2010/SPM/spm.tex
with
81 additions
and
3 deletions
doc/papers/2010/SPM/spm.tex
+
81
−
3
View file @
a85ea645
...
...
@@ -964,11 +964,89 @@ that of the BG/P, we still achieve excellent performance, thanks to
the high data reuse factor.
\section
{
Comparison and Evaluation
}
\label
{
sec:perf-compare
}
\section
{
Programmability
}
\begin{table*}
[t]
\begin{center}
{
\small
\begin{tabular}
{
l|l|l|l|l
}
Intel Core i7 920
&
IBM Blue Gene/P
&
ATI 4870
&
NVIDIA Tesla C1060
&
STI Cell/B.E.
\\
\hline
+ well-known
&
+ L2 prefetch unit works well
&
+ largest number of cores
&
+ random write access
&
+ explicit cache
\\
&
+ high memory bandwidth
&
+ swizzling support
&
+ Cuda is high-level
&
+ random write access
\\
&
&
&
&
+ shuffle capabilities
\\
&
&
&
&
+ power efficiency
\\
&
&
&
&
\\
- few registers
&
- everything double precision
&
- low PCI-e bandwidth
&
- low PCI-e bandwidth
&
- multiple parallelism levels
\\
- no fma
&
- expensive
&
- transfer slows down kernel
&
&
- no increment in odd pipe
\\
- limited shuffling
&
&
- no random write access
&
&
\\
&
&
- bad Brook+ performance
&
&
\\
&
&
- CAL is low-level
&
&
\\
&
&
- not well documented
&
&
\\
\end{tabular}
}
%\small
\end{center}
\vspace
{
-0.5cm
}
\caption
{
Strengths and weaknesses of the different platforms for signal processing applications.
}
\label
{
architecture-results-table
}
\end{table*}
Figure~
\ref
{
performance-graph
}
shows the performance on all
architectures we evaluated. The NVIDIA GPU achieves the highest
\emph
{
absolute
}
performance. Nevertheless, the GPU
\emph
{
efficiencies
}
are much lower than on the other platforms. The
\mbox
{
Cell/B.E.
}
achieves the highest efficiency of all many-core architectures, close
to that of the BG/P. Although the theoretical peak performance of the
\mbox
{
Cell/B.E.
}
is 4.6 times lower than the NVIDIA chip, the absolute
performance is only slightly less. If both chips in the QS21 blade
are used, the
\mbox
{
Cell/B.E.
}
also has the highest absolute
performance. For the GPUs, it is possible to use more than one chip as
well. This can be done in the form of multiple PCI-e cards, or with
two chips on a single card, as is done with the ATI 4870x2
device. However, we found that this does not help, since the
performance is already limited by the low PCI-e throughput, and the
chips have to share this resource.
The graph indeed shows that the
host-to-device I/O has a large impact on the GPU performance, even when using one chip. With
the
\mbox
{
Cell/B.E.
}
, the I/O (from main memory to the Local Store) only has a very small impact.
In Table~
\ref
{
architecture-results-table
}
we summarize the
architectural strengths and weaknesses that we identified. Although
we focus on the correlator application in this paper, the
results are applicable to applications with low flop/byte ratios in
general.
\section
{
Programmability
}
\subsection
{
Aplying the techniques: a case study with the Intel Larrabee
}
The performance gap between assembly and a high-level programming language
is quite different for the different platforms. It also
depends on how much the compiler is helped by manually unrolling
loops, eliminating common sub-expressions, the use of register variables,
etc., up to a level that the C code becomes almost as low-level as assembly
code. The difference varies between only a few percent to a factor of 10.
For the BG/P, the performance from compiled C++ code was by far not
sufficient. The assembly version hides load and instruction
latencies, issues concurrent floating point, integer, and load/store
instructions, and uses the L2 prefetch buffers in the most optimal
way. The resulting code is approximately 10 times faster than C++
code. For both the Cell/B.E. and the Intel core~i7, we found that
high-level code in C or C++ in combination with the use of intrinsics
to manually describe the SIMD parallelism yields acceptable
performance compared to optimized assembly code. Thus, the programmer
specifies which instructions have to be used, but can typically leave
the instruction scheduling and register allocation to the compiler.
On NVIDIA hardware, the high-level Cuda model delivers excellent
performance, as long as the programmer helps by using SIMD data types
for loads and stores, and separate local variables for values that
should be kept in registers. With ATI hardware, this is different. We
found that the high-level Brook+ model does not achieve acceptable
performance compared to hand-written CAL code. Manually written assembly
is more than three times faster. Also, the Brook+ documentation is insufficient.
\section
{
Aplying the techniques: a case study with the Intel Larrabee
}
Intel recently disclosed some details about the upcoming Larrabee processor,
a fully programmable GPU based on the well-known x86 instruction set.
...
...
@@ -992,7 +1070,7 @@ consecutive memory locations.
Both
Another option is to correlate samples from different receivers as illustrated
by Figure~
\ref
{
fig
:4x4
-correlation
}
.
by Figure~
\ref
{
fig-correlation
}
.
This method minimizes memory loads, but requires additional shuffling of data.
Unfortunately, the most efficient method can only be determined empirically,
when the hardware is available.
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment