diff --git a/doc/papers/2010/SPM/spm.tex b/doc/papers/2010/SPM/spm.tex
index bfb67a7786e97d30389443aca129fcb015cf24b8..153c436eff69b69ddf711465fbae454094204cf6 100644
--- a/doc/papers/2010/SPM/spm.tex
+++ b/doc/papers/2010/SPM/spm.tex
@@ -418,8 +418,8 @@ registers per core x register width          & 16x4          & 64x2           &
 total device RAM bandwidth (GB/s)            & n.a.          & n.a.           & 115.2         & 102            & n.a.          \\
 \textbf{total host RAM bandwidth (GB/s)}     & \textbf{25.6} & \textbf{13.6}  & \textbf{8.0}  & \textbf{8.0}   & \textbf{25.8} \\
 %\hline
-Process Technology (nm)                      & 45            & 90             & 55            & 65             & 65            \\
-TDP (W)                                      & 130           & 24             & 160           & 236            & 70            \\
+%Process Technology (nm)                      & 45            & 90             & 55            & 65             & 65            \\
+%TDP (W)                                      & 130           & 24             & 160           & 236            & 70            \\
 %\textbf{gflops / Watt (based on TDP)}       & \textbf{0.65} & \textbf{0.57}  & \textbf{7.50} & \textbf{3.97}  & \textbf{2.93} \\
 %\hline
 %\textbf{gflops/device bandwidth (gflops / GB/s)}& n.a.       &  n.a.          & \textbf{10.4} & \textbf{9.2}   & n.a.         \\
@@ -460,7 +460,7 @@ TDP (W)                                      & 130           & 24             &
 %% \end{center}
 %% \end{table*}
 
-In this section, we briefly explain key properties of six different
+In this section, we briefly explain key properties of five different
 architectures with multiple cores.  We focus on the differences
 between the systems that are relevant for signal processing
 applications. Table~\ref{architecture-properties} shows the most
@@ -478,12 +478,12 @@ precision.  The parallelism comes from four cores with two-way
 hyperthreading, and a vector length of four floats, provided by the
 SSE4 instruction set.
 
-SSE4 does not provide fused multiply-add instructions, but the Core~i7
-issues vector-multiply and vector-add instructions concurrently in
-different pipelines, allowing eight flops per cycle per core.  One
-problem of SSE4 that complicates an efficient correlator is the
-limited support for shuffling data within vector registers, unlike the
-Cell/B.E., for instance, that can shuffle any byte to any position.
+%% SSE4 does not provide fused multiply-add instructions, but the Core~i7
+%% issues vector-multiply and vector-add instructions concurrently in
+%% different pipelines, allowing eight flops per cycle per core.  
+A problem with SSE4 is the
+limited support for shuffling data within vector registers. This is unlike the
+Cell/B.E. and ATI GPUs, that can shuffle values in all possible combinations.
 Also, the number of vector registers is small (sixteen four-word
 registers).  Therefore, the is not much opportunity to reuse data in
 registers; reuse has to come from the L1~data cache. 
@@ -506,15 +506,15 @@ We found that the BG/P is extremely suitable for our application,
 since it is highly optimized for processing of complex numbers.
 The BG/P performs \emph{all} floating point operations in double
 precision, which is overkill for our application.
-In contrast to all other architectures we evaluate, the problem is compute
-bound instead of I/O bound, thanks to the BG/P's high memory bandwidth per
-operation, which is 3--10 times higher than for the other architectures.
 The BG/P has 32 vector registers of width 2.  Therefore, 64 floating
 point numbers can be kept in registers
 simultaneously. Although this is the same amount as on the general purpose
 Intel chip, an important difference is that the BG/P has 32
 registers of width 2, compared to Intel's 16 of width 4.  The smaller
 vector size reduces the amount of shuffle instructions needed.
+In contrast to all other architectures we evaluate, the problem is compute
+bound instead of I/O bound, thanks to the BG/P's high memory bandwidth per
+operation, which is 3--10 times higher than for the other architectures.
 
 
 \subsection{ATI GPU}
@@ -524,9 +524,10 @@ the 4870~\cite{amd-manual}.  The chip contains 160 cores, with 800 FPUs in total
 and has a theoretical peak performance of
 1.2~teraflops. The board uses a PCI-express~2.0 interface
 for communication with the host system.
-The application can specify if a read should be
-cached by the texture cache or not, while the streaming processors have 16 KB of shared
-memory that is completely managed by the application.
+The streaming processors have 16 KB of shared
+memory that is completely managed by the application. It is also
+possible to specify if a read should be
+cached by the texture cache or not.
 
 The ATI 4870 GPU has the largest number of FPUs of all architectures
 we evaluate.  However, the architecture has several important
@@ -546,7 +547,7 @@ read-performance bound, this does not have a large impact.
 \subsection{NVIDIA GPU}
 
 NVIDIA's Tesla C1060 contains a GTX~280 GPU with 240 single precision
-and 30 double precision FPUs. The GTX~280 uses a two-level hierarchy to group cores.
+and 30 double precision FPUs~\cite{cuda-manual}. The GTX~280 uses a two-level hierarchy to group cores.
 There are 30~independent \emph{multiprocessors\/} that each have 8~cores.
 Current NVIDIA GPUs have fewer
 cores than ATI GPUs, but the individual cores are faster.
@@ -581,13 +582,12 @@ heterogeneous many-core processor, designed by Sony, Toshiba and IBM
 Element (PPE), acting as a main processor, and eight Synergistic
 Processing Elements (SPEs) that provide the real processing power.
 The cores, the main memory, and the external I/O are connected by a
-high-bandwidth Element Interconnection Bus (EIB).  The main memory has
-a high-bandwidth, and uses XDR (Rambus).  The PPE's main role is to
+high-bandwidth element interconnection bus.  The main memory has
+a relatively high bandwidth.  The PPE's main role is to
 run the operating system and to coordinate the SPEs.  An SPE contains
-a RISC-core (the Synergistic Processing Unit (SPU)), a 256KB Local
-Store (LS), and a memory flow controller.
+a RISC-core, a 256KB Local Store (LS), and a memory flow controller.
 
-The LS is an extremely fast local memory (SRAM) for both code and data
+The LS is an extremely fast local memory for both code and data
 and is managed entirely by the application with explicit DMA
 transfers.  The LS can be considered the SPU's L1 cache.  The
 \mbox{Cell/B.E.} has a large number of registers: each SPU has 128,
@@ -609,15 +609,16 @@ system have a total theoretical single-precision peak performance of
 \begin{table*}[t]
 \begin{center}
 {\small
-\begin{tabular}{l|l|l}
+\begin{tabular}{|l|l|l|}
+\hline
 feature                   & Cell/B.E.                      & GPUs \\
 \hline
 access times              & uniform                        & non-uniform \\
 cache sharing level       & single thread (SPE)            & all threads in a multiprocessor \\
 access to off-chip memory & only through DMA               & supported \\
 memory access overlapping & asynchronous DMA               & hardware-managed thread preemption \\
-communication             & DMA between SPEs               & independent thread blocks + \\
-                          &                                & shared memory within a block \\
+communication             & DMA between SPEs               & independent thread blocks + shared memory within a block \\
+\hline
 \end{tabular}
 } %\small
 \end{center}
@@ -631,15 +632,15 @@ processing applications. Explicit support for complex operations is
 preferable, both in terms of programmability and performance.  If it
 is not available, we can circumvent this by using separate arrays for
 real values and for imaginary values.  Except for the Blue Gene/P (and
-to some extent the Core~i7), none of the architectures do not support
+to some extent the Core~i7), none of the architectures support
 complex operations.
 
 The different architectures require two different approaches of
-dealing with this problem. First, if an architecture does not use
+dealing with this problem. If an architecture does not use
 explicit SIMD (vector) parallelism, the complex operations can simply
 be expressed in terms of normal floating point operations. This puts
 an extra burden on the programmer, but achieves good performance. The
-NVIDIA GPUs work this way.  Second, if an architecture does use vector
+NVIDIA GPUs work this way.  However, if an architecture does use vector
 parallelism, we can either store the real and complex parts inside a
 single vector, or have separate vectors for the two parts.  In both
 cases, support for shuffling data inside the vector registers is
@@ -652,7 +653,7 @@ GPUs, this works in a similar way.  The SSE4 instruction set in the
 Intel core~i7, however, does not support arbitrary shuffling patterns.
 This has a large impact on the way the code is vectorized, and
 requires a different SIMDization strategy. In the case of the
-correlator, this led to suboptimal performance.
+correlator, this results in suboptimal performance.
 
 %% complexe getallen zijn belangrijk voor signal processing.
 %% Niet alle arch ondersteunen dit even goed.
@@ -665,33 +666,34 @@ correlator, this led to suboptimal performance.
 %%     De ene arch kan dit beter dan de andere.
 
 On many-core architectures, the memory bandwidth is shared between the
-cores.  This has shifted the balance between between compute
-operations and memory loads.  The available memory bandwidth per
-operation has decreased considerably.  For the many-core architecures
+cores.  This has shifted the balance between between computational
+and memory performance.  The available memory bandwidth per
+operation has decreased dramatically.  For the many-core architecures
 we use here, the bandwidth per operation is 3--10 times lower than on
 the BG/P, for instance.  Therefore, we must treat memory bandwidth as
 a scarce resource, and it is important to minimize the number of
-memory accesses.  In fact, we found that on many-core architectures,
+memory accesses.  In fact, the most important lesson of this paper is that on many-core architectures,
 optimizing the memory properties of the algoritms is more important
 than focussing on reducing the number of compute cycles that is used,
 as is traditionally done on systems with only a few or just one core.
 
 Optimizing the memory behavior of an algorithm has two different
-aspects.  First, the number of accesses per operation should be
-reduces as much as possible, sometimes even at the cost of more
+aspects.  First, the \emph{number} of memory accesses per operation should be
+reduced as much as possible, sometimes even at the cost of more
 compute cycles.  Second, it is important to think about the memory
-access patterns. Typically, several cores share one or more cache
+\emph{access patterns}. Typically, several cores share one or more cache
 levels. Therefore, the access patterns of several different threads
 that share a cache should be tailored accordingly. On GPUs, for
 example, this can be done by \emph{coalescing} memory accesses.  This
 means that different concurrent threads read subsequent memory
-locations.  This can be counter-intuitive, since traditionally, it was
-better to have linear memory access patterns within a thread.  In the
+locations.  This can be counter intuitive, since traditionally, it was
+better to have linear memory access patterns within a thread. Table~\ref{memory-properties} summarizes
+the differences in memory architectures of the different platforms. In the
 next section, we explain the techniques described above by applying
 them to the correlator application.
 
 
-\section{Optimizing the correlator}
+\section{Implementing and optimizing the correlator}
 \label{sec:optimizing}
 
 % TODO add text about mapping from alg to arch