From 77d791f18b0c4428d17223b65b1052a5bbb64611 Mon Sep 17 00:00:00 2001
From: Rob van Nieuwpoort <nieuwpoort@astron.nl>
Date: Tue, 29 Sep 2009 13:39:43 +0000
Subject: [PATCH] Bug 1198: working on mem opt text

---
 doc/papers/2010/SPM/spm.tex | 29 +++++++++++++----------------
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/doc/papers/2010/SPM/spm.tex b/doc/papers/2010/SPM/spm.tex
index 63c38318c5c..708ca430765 100644
--- a/doc/papers/2010/SPM/spm.tex
+++ b/doc/papers/2010/SPM/spm.tex
@@ -703,25 +703,22 @@ the parameters and sizes of the caches, by carefully choosing the
 placement of variables, objects and arrays in
 memory~\cite{Panda96memorydata}.
 
+The memory systems of the many-core architectures are quite
+complex. On the GPUs, for instance, several levels of texture cache
+are available, in addition to local memory, and application-managed
+shared memory in several banks.  than cannot be accessed
+concurrently. There also are complex interactions between the memory
+system and the hardware thread scheduler that tries to overlap memory
+latencies.  GPUs literally run tens of thousands of parallel threads
+to keep all functional units fully occupied.  We apply the techniques described
+above in software by hand, since we found that the current compilers
+for the many-core architectures do not (yet) implement them well on
+their complex memory systems.
 
 
-\subsubsection{Appying the techniques}
-
-
-wij doen het met de hand, compilers nog niet goed genoeg. OpenCL helpt misschien, door runtime compilation.
-
-gpus:
+% OpenCL helpt misschien, door runtime compilation.
 
-massively multi-threaded
-coalescing (different threads read/write subsequent addresses)
-order of magnitute more impact than on normal shared mem machines (10-100x slowdown if coalescing not right)
-banking of shared memory
-texture cache
-SMT to overlap delays
-Alles nodig voor goede performance, compilers doen dit (nog) niet.
-
-
-literally tens of thousends of parallel threads to keep all functional units busy, and to overlap mem latencies.
+\subsubsection{Appying the techniques}
 
 So, the second step of mapping a signal-processing algorithm to a many-core architecture
 is optimizing the memory behavior. We can split this step into two phases:
-- 
GitLab