Skip to content
Snippets Groups Projects
Commit a78b8ea0 authored by Rob van Nieuwpoort's avatar Rob van Nieuwpoort
Browse files

Bug 1198: longversion stukken teruggezet om de 10 paginas te vullen

parent c5fa1cb3
No related branches found
No related tags found
No related merge requests found
......@@ -604,36 +604,6 @@ optimizing the memory properties of the algorithms is more important
than focusing on reducing the number of compute cycles that is used,
as is traditionally done on systems with only a few or just one core.
\begin{table}[t]
\begin{center}
{\footnotesize
\begin{tabular}{|l|l|l|}
\hline
feature & Cell/B.E. & GPUs \\
\hline
access times & uniform & non-uniform \\
\hline
cache sharing level & single thread (SPE) & all threads in a \\
& & multiprocessor \\
\hline
access to off-chip mem. & through DMA only & supported \\
\hline
memory access & asynchronous DMA & hardware-managed \\
overlapping & & thread preemption \\
\hline
communication & DMA between SPEs & independent thread \\
& & blocks \& shared \\
& & mem. within a block \\
\hline
\end{tabular}
} %\small
\end{center}
\vspace{-0.5cm}
\caption{Differences between memory architectures.}
\label{memory-properties}
\end{table}
\subsubsection{Well-known memory optimization techniques}
......@@ -738,6 +708,35 @@ same data. For the correlator, the most important insight here
is a technique to exploit date reuse opportunities, reducing the number of memory
loads. We explain this in detail in Section~\ref{sec:tiling}.
\begin{table}[t]
\begin{center}
{\footnotesize
\begin{tabular}{|l|l|l|}
\hline
feature & Cell/B.E. & GPUs \\
\hline
access times & uniform & non-uniform \\
\hline
cache sharing level & single thread (SPE) & all threads in a \\
& & multiprocessor \\
\hline
access to off-chip mem. & through DMA only & supported \\
\hline
memory access & asynchronous DMA & hardware-managed \\
overlapping & & thread preemption \\
\hline
communication & DMA between SPEs & independent thread \\
& & blocks \& shared \\
& & mem. within a block \\
\hline
\end{tabular}
} %\small
\end{center}
\vspace{-0.5cm}
\caption{Differences between memory architectures.}
\label{memory-properties}
\end{table}
The second phase deals with architecture-specific optimizations.
In this phase, we do not reduce the \emph{number} of memory loads, but think about the
memory \emph{access patterns}. Typically, several cores share one or
......@@ -1056,6 +1055,27 @@ hardware, this is caused by the low PCI-e bandwidth. With NVIDIA
hardware significant performance gains can be achieved by using asynchronous host-to-device I/O.
\begin{table*}[t]
\begin{center}
%{\footnotesize % for normal layout
{\scriptsize % for double spaced
\begin{tabular}{l|l|l|l|l}
Intel Core i7 920 & IBM Blue Gene/P & ATI 4870 & NVIDIA Tesla C1060 & STI Cell/B.E. \\
\hline
+ well-known & + L2 prefetch unit & + largest number of cores & + random write access & + power efficiency \\
-- few registers & + high memory bandwidth & + swizzling support & + Cuda is high-level & + random write access \\
-- no fma instruction & + fast interconnects & -- low PCI-e bandwidth & -- low PCI-e bandwidth & + shuffle capabilities \\
-- limited shuffling & -- double precision only & -- transfer slows down kernel & & + explicit cache (performance) \\
& -- expensive & -- no random write access & & -- explicit cache (programmability) \\
& & -- bad programming support & & -- multiple parallelism levels \\
\end{tabular}
} %\small
\end{center}
\vspace{-0.5cm}
\caption{Strengths and weaknesses of the different platforms for signal-processing applications.}
\label{architecture-results-table}
\end{table*}
\noindent \\ \emph{The Cell Broadband Engine}
\noindent With the
......@@ -1104,27 +1124,6 @@ the high data reuse factor.
\subsection{Comparison and Evaluation}
\label{sec:perf-compare}
\begin{table*}[t]
\begin{center}
%{\footnotesize % for normal layout
{\scriptsize % for double spaced
\begin{tabular}{l|l|l|l|l}
Intel Core i7 920 & IBM Blue Gene/P & ATI 4870 & NVIDIA Tesla C1060 & STI Cell/B.E. \\
\hline
+ well-known & + L2 prefetch unit & + largest number of cores & + random write access & + power efficiency \\
-- few registers & + high memory bandwidth & + swizzling support & + Cuda is high-level & + random write access \\
-- no fma instruction & + fast interconnects & -- low PCI-e bandwidth & -- low PCI-e bandwidth & + shuffle capabilities \\
-- limited shuffling & -- double precision only & -- transfer slows down kernel & & + explicit cache (performance) \\
& -- expensive & -- no random write access & & -- explicit cache (programmability) \\
& & -- bad programming support & & -- multiple parallelism levels \\
\end{tabular}
} %\small
\end{center}
\vspace{-0.5cm}
\caption{Strengths and weaknesses of the different platforms for signal-processing applications.}
\label{architecture-results-table}
\end{table*}
Figure~\ref{performance-graph} shows the performance on all
architectures we evaluated. The NVIDIA GPU achieves the highest
\emph{absolute} performance. Nevertheless, the GPU \emph{efficiencies}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment