Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
LOFAR
Manage
Activity
Members
Labels
Plan
Issues
Wiki
Jira issues
Open Jira
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Locked files
Deploy
Releases
Package registry
Container Registry
Model registry
Operate
Environments
Terraform modules
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Code review analytics
Insights
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
RadioObservatory
LOFAR
Commits
a78b8ea0
Commit
a78b8ea0
authored
15 years ago
by
Rob van Nieuwpoort
Browse files
Options
Downloads
Patches
Plain Diff
Bug 1198: longversion stukken teruggezet om de 10 paginas te vullen
parent
c5fa1cb3
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/papers/2010/SPM/spm.tex
+50
-51
50 additions, 51 deletions
doc/papers/2010/SPM/spm.tex
with
50 additions
and
51 deletions
doc/papers/2010/SPM/spm.tex
+
50
−
51
View file @
a78b8ea0
...
...
@@ -604,36 +604,6 @@ optimizing the memory properties of the algorithms is more important
than focusing on reducing the number of compute cycles that is used,
as is traditionally done on systems with only a few or just one core.
\begin{table}
[t]
\begin{center}
{
\footnotesize
\begin{tabular}
{
|l|l|l|
}
\hline
feature
&
Cell/B.E.
&
GPUs
\\
\hline
access times
&
uniform
&
non-uniform
\\
\hline
cache sharing level
&
single thread (SPE)
&
all threads in a
\\
&
&
multiprocessor
\\
\hline
access to off-chip mem.
&
through DMA only
&
supported
\\
\hline
memory access
&
asynchronous DMA
&
hardware-managed
\\
overlapping
&
&
thread preemption
\\
\hline
communication
&
DMA between SPEs
&
independent thread
\\
&
&
blocks
\&
shared
\\
&
&
mem. within a block
\\
\hline
\end{tabular}
}
%\small
\end{center}
\vspace
{
-0.5cm
}
\caption
{
Differences between memory architectures.
}
\label
{
memory-properties
}
\end{table}
\subsubsection
{
Well-known memory optimization techniques
}
...
...
@@ -738,6 +708,35 @@ same data. For the correlator, the most important insight here
is a technique to exploit date reuse opportunities, reducing the number of memory
loads. We explain this in detail in Section~
\ref
{
sec:tiling
}
.
\begin{table}
[t]
\begin{center}
{
\footnotesize
\begin{tabular}
{
|l|l|l|
}
\hline
feature
&
Cell/B.E.
&
GPUs
\\
\hline
access times
&
uniform
&
non-uniform
\\
\hline
cache sharing level
&
single thread (SPE)
&
all threads in a
\\
&
&
multiprocessor
\\
\hline
access to off-chip mem.
&
through DMA only
&
supported
\\
\hline
memory access
&
asynchronous DMA
&
hardware-managed
\\
overlapping
&
&
thread preemption
\\
\hline
communication
&
DMA between SPEs
&
independent thread
\\
&
&
blocks
\&
shared
\\
&
&
mem. within a block
\\
\hline
\end{tabular}
}
%\small
\end{center}
\vspace
{
-0.5cm
}
\caption
{
Differences between memory architectures.
}
\label
{
memory-properties
}
\end{table}
The second phase deals with architecture-specific optimizations.
In this phase, we do not reduce the
\emph
{
number
}
of memory loads, but think about the
memory
\emph
{
access patterns
}
. Typically, several cores share one or
...
...
@@ -1056,6 +1055,27 @@ hardware, this is caused by the low PCI-e bandwidth. With NVIDIA
hardware significant performance gains can be achieved by using asynchronous host-to-device I/O.
\begin{table*}
[t]
\begin{center}
%{\footnotesize % for normal layout
{
\scriptsize
% for double spaced
\begin{tabular}
{
l|l|l|l|l
}
Intel Core i7 920
&
IBM Blue Gene/P
&
ATI 4870
&
NVIDIA Tesla C1060
&
STI Cell/B.E.
\\
\hline
+ well-known
&
+ L2 prefetch unit
&
+ largest number of cores
&
+ random write access
&
+ power efficiency
\\
-- few registers
&
+ high memory bandwidth
&
+ swizzling support
&
+ Cuda is high-level
&
+ random write access
\\
-- no fma instruction
&
+ fast interconnects
&
-- low PCI-e bandwidth
&
-- low PCI-e bandwidth
&
+ shuffle capabilities
\\
-- limited shuffling
&
-- double precision only
&
-- transfer slows down kernel
&
&
+ explicit cache (performance)
\\
&
-- expensive
&
-- no random write access
&
&
-- explicit cache (programmability)
\\
&
&
-- bad programming support
&
&
-- multiple parallelism levels
\\
\end{tabular}
}
%\small
\end{center}
\vspace
{
-0.5cm
}
\caption
{
Strengths and weaknesses of the different platforms for signal-processing applications.
}
\label
{
architecture-results-table
}
\end{table*}
\noindent
\\
\emph
{
The Cell Broadband Engine
}
\noindent
With the
...
...
@@ -1104,27 +1124,6 @@ the high data reuse factor.
\subsection
{
Comparison and Evaluation
}
\label
{
sec:perf-compare
}
\begin{table*}
[t]
\begin{center}
%{\footnotesize % for normal layout
{
\scriptsize
% for double spaced
\begin{tabular}
{
l|l|l|l|l
}
Intel Core i7 920
&
IBM Blue Gene/P
&
ATI 4870
&
NVIDIA Tesla C1060
&
STI Cell/B.E.
\\
\hline
+ well-known
&
+ L2 prefetch unit
&
+ largest number of cores
&
+ random write access
&
+ power efficiency
\\
-- few registers
&
+ high memory bandwidth
&
+ swizzling support
&
+ Cuda is high-level
&
+ random write access
\\
-- no fma instruction
&
+ fast interconnects
&
-- low PCI-e bandwidth
&
-- low PCI-e bandwidth
&
+ shuffle capabilities
\\
-- limited shuffling
&
-- double precision only
&
-- transfer slows down kernel
&
&
+ explicit cache (performance)
\\
&
-- expensive
&
-- no random write access
&
&
-- explicit cache (programmability)
\\
&
&
-- bad programming support
&
&
-- multiple parallelism levels
\\
\end{tabular}
}
%\small
\end{center}
\vspace
{
-0.5cm
}
\caption
{
Strengths and weaknesses of the different platforms for signal-processing applications.
}
\label
{
architecture-results-table
}
\end{table*}
Figure~
\ref
{
performance-graph
}
shows the performance on all
architectures we evaluated. The NVIDIA GPU achieves the highest
\emph
{
absolute
}
performance. Nevertheless, the GPU
\emph
{
efficiencies
}
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment