Bug 1198: referenties

f64c91ae · Rob van Nieuwpoort · dce50220 · f64c91ae
Commit f64c91ae authored 15 years ago by Rob van Nieuwpoort
--- a/doc/papers/2010/SPM/spm.tex
+++ b/doc/papers/2010/SPM/spm.tex
@@ -593,12 +593,12 @@ strategies.
 On many-core architectures, the memory bandwidth is shared between the
 cores.  This has shifted the balance between between computational and
-memory performance.  The available memory bandwidth per operation has
+memory performance.  The available memory bandwidth \emph{per operation} has
 decreased dramatically compared to traditional processors.  For the
 many-core architectures we use here, the theoretical bandwidth per operation is
 3--10 times lower than on the BG/P, for instance. In practice, if algorithms
 are not optimized well for many-core platforms, the achieved memory bandwidth can
-easily be a hundred times lower than the theoretical maximum.
+easily be ten to a hundred times lower than the theoretical maximum.
 Therefore, we must
 treat memory bandwidth as a scarce resource, and it is important to
 minimize the number of memory accesses.  In fact, one of the most
@@ -640,53 +640,53 @@ communication             & DMA between SPEs               & independent thread
 \subsubsection{Well-known memory optimization techniques}
-The insight that optimizing the interaction with the memory system is becoming more and more important
+The insight that optimizing the interaction with the memory system is
-is not new.  The book by Catthoor et al.~\cite{data-access} is an
+becoming more and more important is not new.  The book by Catthoor et
-excellent starting point for more information on memory-system related
+al.~\cite{data-access} is an excellent starting point for more
-optimizations. The authors focus on multimedia applications, but the
+information on memory-system related optimizations. 
-techniques described there are also applicable to the field of signal
+%The authors focus
-processing, which has many similarities to multimedia.  
+%on multimedia applications, but the techniques described there are
+%also applicable to the field of signal processing, which has many
-We can
+%similarities to multimedia.
-make a distinction between hardware and software memory optimization techniques. 
-The software techniques can be divided further into compiler optimizations and
+We can make a distinction between hardware and software memory
-algorithmic improvements.
+optimization techniques.  Examples of hardware-based techniques include caching, data
-Examples of hardware-based techniques include caching, data
+prefetching and pipelining. The software techniques can be divided
-prefetching and pipelining.  The distinction between hardware and
+further into compiler optimizations and algorithmic improvements.
-software is not entirely black and white. Data prefetching, for instance,
+  The distinction between hardware and
-can be done both in harware and software.
+software is not entirely black and white. Data prefetching, for
-Another good example is the
+instance, can be done both in harware and software.  Another good
-explicit cache of the Cell/B.E. processor. This can be seen as an
+example is the explicit cache of the \mbox{Cell/B.E.} processor. It can be
-architecture where the programmer handles the cache replacement
+seen as an architecture where the programmer handles the cache
-policies instead of the hardware.
+replacement policies instead of the hardware.
 Many optimizations focus on utilizing data caches more efficiently.
 Hardware cache hierarchies are in principle transparent for the
 application. Nevertheless, it is important to take the sizes of the
-different cache levels into account when optimizing an algorithm.
+different cache levels into account when optimizing an algorithm.  A
-A cache line is the smallest unit of memory than can be transferred
+cache line is the smallest unit of memory than can be transferred
-between the main memory and the cache.  
+between the main memory and the cache.  Code can be optimized for the
-Code can be optimized for the size of the cache lines of a particular architecture. 
+cache line size of a particular architecture.  Moreover, the
-Moreover, the associativity of
+associativity of the cache can be important.  If a cache is N-way set
-the cache can be important.  If a cache is N-way set associative, this
+associative, this means that any particular location in memory can be
-means that any particular location in memory can be cached in either
+cached in either of N locations in the data cache. Algorithms can be
-of N locations in the data cache. Algorithms can be designed such that
+designed such that they take care that cache lines that are needed
-they take care that cache lines that are needed later are not
+later are not replaced prematurely.  Finally, prefetching can be used
-replaced prematurely.  Finally, prefetching can be used to load data into caches
+to load data into caches or registers ahead of time.
-or registers ahead of time.
 Many cache-related optimization techniques have been described in the
 literature, both in the context of hardware and software. For instance,
 an efficient implementation of hardware-based prefetching is described
 in~\cite{Chen95effectivehardware-based}.  As we will describe in
-Section~\ref{sec:optimizing}, we implemented prefetching manually in the code, for
+Section~\ref{sec:optimizing}, we implemented prefetching manually in software, for
-example by using multi-buffering on the Cell/B.E., or by explicitly
+example by using multi-buffering on the \mbox{Cell/B.E.}, or by explicitly
 loading data into shared memory or registers on the GPUs.  A good
 starting point for cache-aware or cache-oblivious algorithms
 is~\cite{cache}. Another good example of a technique that we used to
-improve cache efficiencies for the correlator is the padding of arrays
+improve cache efficiencies for the correlator is the padding of multi-dimensional arrays
 with extra ``dummy'' data elements. This way, we can make sure that
-cache replacement policies work well. This well-known technique
+cache replacement policies work well, and subsequent elements in an array dimension are not mapped
+onto the same cache location. This well-known technique
 is described, for instance, by Bacon et al.~\cite{cache-tlb-compiler}.
 Many additional data access patterns optimization techniques are described
 in~\cite{data-access}.
@@ -703,13 +703,13 @@ placement of variables, objects and arrays in
 memory~\cite{Panda96memorydata}.
 The memory systems of the many-core architectures are quite
-complex. On the GPUs, for instance, several levels of texture cache
+complex. GPUs, for instance, have banked device memory, several levels of texture cache,
-are available, in addition to local memory, and application-managed
+in addition to local memory, and application-managed
-shared memory in several banks.  than cannot be accessed
+shared memory (also divided over several banks). 
-concurrently. There also are complex interactions between the memory
+There also are complex interactions between the memory
-system and the hardware thread scheduler that tries to overlap memory
+system and the hardware thread scheduler.
-latencies.  GPUs literally run tens of thousands of parallel threads
+GPUs literally run tens of thousands of parallel threads
-to keep all functional units fully occupied.  We apply the techniques described
+to overlap memory latencies, trying to keep all functional units fully occupied.  We apply the techniques described
 above in software by hand, since we found that the current compilers
 for the many-core architectures do not (yet) implement them well on
 their complex memory systems.