Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
LOFAR
Manage
Activity
Members
Labels
Plan
Issues
Wiki
Jira issues
Open Jira
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Locked files
Deploy
Releases
Package registry
Container Registry
Model registry
Operate
Environments
Terraform modules
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Code review analytics
Insights
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
RadioObservatory
LOFAR
Commits
f64c91ae
Commit
f64c91ae
authored
15 years ago
by
Rob van Nieuwpoort
Browse files
Options
Downloads
Patches
Plain Diff
Bug 1198: referenties
parent
dce50220
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/papers/2010/SPM/spm.tex
+43
-43
43 additions, 43 deletions
doc/papers/2010/SPM/spm.tex
with
43 additions
and
43 deletions
doc/papers/2010/SPM/spm.tex
+
43
−
43
View file @
f64c91ae
...
...
@@ -593,12 +593,12 @@ strategies.
On many-core architectures, the memory bandwidth is shared between the
cores. This has shifted the balance between between computational and
memory performance. The available memory bandwidth per operation has
memory performance. The available memory bandwidth
\emph
{
per operation
}
has
decreased dramatically compared to traditional processors. For the
many-core architectures we use here, the theoretical bandwidth per operation is
3--10 times lower than on the BG/P, for instance. In practice, if algorithms
are not optimized well for many-core platforms, the achieved memory bandwidth can
easily be a hundred times lower than the theoretical maximum.
easily be
ten to
a hundred times lower than the theoretical maximum.
Therefore, we must
treat memory bandwidth as a scarce resource, and it is important to
minimize the number of memory accesses. In fact, one of the most
...
...
@@ -640,53 +640,53 @@ communication & DMA between SPEs & independent thread
\subsubsection
{
Well-known memory optimization techniques
}
The insight that optimizing the interaction with the memory system is
becoming more and more important
is not new. The book by Catthoor et
al.~
\cite
{
data-access
}
is an
excellent starting point for more
information on memory-system related
optimizations. The authors focus on multimedia applications, but the
techniques described there are also applicable to the field of signal
processing, which has many similarities to multimedia.
We can
make a distinction between hardware and software memory optimization techniques.
The software techniques can be divided further into compiler optimizations and
algorithmic improvements.
Examples of hardware-based techniques include caching, data
prefetching and pipelining. The distinction between hardware and
softwar
e is
not entirely black and white. Data prefetching, for instance,
can be done both in harware and software.
Another good example is the
explicit cache of the Cell/B.E. processor.
This
can be
seen as an
architecture where the programmer handles the cache
replacement
policies instead of the hardware.
The insight that optimizing the interaction with the memory system is
becoming more and more important
is not new. The book by Catthoor et
al.~
\cite
{
data-access
}
is an
excellent starting point for more
information on memory-system related optimizations.
%The authors focus
%on multimedia applications, but the techniques described there are
%also applicable to the field of signal processing, which has many
%similarities to multimedia.
We can make a distinction between hardware and software memory
optimization techniques. Examples of hardware-based techniques include caching, data
prefetching and pipelining. The software techniques can be divided
further into compiler optimizations and algorithmic improvements.
Th
e
d
is
tinction between hardware and
software is not entirely black and white. Data prefetching, for
instance, can be done both in harware and software. Another good
example is the
explicit cache of the
\mbox
{
Cell/B.E.
}
processor.
It
can be
seen as an
architecture where the programmer handles the cache
replacement
policies instead of the hardware.
Many optimizations focus on utilizing data caches more efficiently.
Hardware cache hierarchies are in principle transparent for the
application. Nevertheless, it is important to take the sizes of the
different cache levels into account when optimizing an algorithm.
A cache line is the smallest unit of memory than can be transferred
between the main memory and the cache.
Code can be optimized for the size of the cache lines of a particular architecture.
Moreover, the associativity of
the cache can be important. If a cache is N-way set associative, this
means that any particular location in memory can be cached in either
of N locations in the data cache. Algorithms can be designed such that
they take care that cache lines that are needed later are not
replaced prematurely. Finally, prefetching can be used to load data into caches
or registers ahead of time.
different cache levels into account when optimizing an algorithm. A
cache line is the smallest unit of memory than can be transferred
between the main memory and the cache. Code can be optimized for the
cache line size of a particular architecture. Moreover, the
associativity of the cache can be important. If a cache is N-way set
associative, this means that any particular location in memory can be
cached in either of N locations in the data cache. Algorithms can be
designed such that they take care that cache lines that are needed
later are not replaced prematurely. Finally, prefetching can be used
to load data into caches or registers ahead of time.
Many cache-related optimization techniques have been described in the
literature, both in the context of hardware and software. For instance,
an efficient implementation of hardware-based prefetching is described
in~
\cite
{
Chen95effectivehardware-based
}
. As we will describe in
Section~
\ref
{
sec:optimizing
}
, we implemented prefetching manually in
the cod
e, for
example by using multi-buffering on the Cell/B.E., or by explicitly
Section~
\ref
{
sec:optimizing
}
, we implemented prefetching manually in
softwar
e, for
example by using multi-buffering on the
\mbox
{
Cell/B.E.
}
, or by explicitly
loading data into shared memory or registers on the GPUs. A good
starting point for cache-aware or cache-oblivious algorithms
is~
\cite
{
cache
}
. Another good example of a technique that we used to
improve cache efficiencies for the correlator is the padding of arrays
improve cache efficiencies for the correlator is the padding of
multi-dimensional
arrays
with extra ``dummy'' data elements. This way, we can make sure that
cache replacement policies work well. This well-known technique
cache replacement policies work well, and subsequent elements in an array dimension are not mapped
onto the same cache location. This well-known technique
is described, for instance, by Bacon et al.~
\cite
{
cache-tlb-compiler
}
.
Many additional data access patterns optimization techniques are described
in~
\cite
{
data-access
}
.
...
...
@@ -703,13 +703,13 @@ placement of variables, objects and arrays in
memory~
\cite
{
Panda96memorydata
}
.
The memory systems of the many-core architectures are quite
complex.
On the
GPUs, for instance, several levels of texture cache
are available,
in addition to local memory, and application-managed
shared memory
in
several banks.
than cannot be accessed
concurrently.
There also are complex interactions between the memory
system and the hardware thread scheduler
that tries to overlap memory
latencies.
GPUs literally run tens of thousands of parallel threads
to keep all functional units fully occupied. We apply the techniques described
complex. GPUs, for instance,
have banked device memory,
several levels of texture cache
,
in addition to local memory, and application-managed
shared memory
(also divided over
several banks
)
.
There also are complex interactions between the memory
system and the hardware thread scheduler
.
GPUs literally run tens of thousands of parallel threads
to overlap memory latencies, trying
to keep all functional units fully occupied. We apply the techniques described
above in software by hand, since we found that the current compilers
for the many-core architectures do not (yet) implement them well on
their complex memory systems.
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment