From b4206617a44ca2cf5ee802693bd760ae7519293f Mon Sep 17 00:00:00 2001 From: Rob van Nieuwpoort <nieuwpoort@astron.nl> Date: Tue, 30 Jun 2009 09:15:01 +0000 Subject: [PATCH] Bug 1198: s5-end --- doc/papers/2010/SPM/spm.tex | 143 +++++++++++++----------------------- 1 file changed, 52 insertions(+), 91 deletions(-) diff --git a/doc/papers/2010/SPM/spm.tex b/doc/papers/2010/SPM/spm.tex index 153c436eff6..5eb00451439 100644 --- a/doc/papers/2010/SPM/spm.tex +++ b/doc/papers/2010/SPM/spm.tex @@ -721,8 +721,8 @@ autocorrelations. For example, the samples from receivers 8, 9, 10, and 11 can be correlated with the samples from receivers 4, 5, 6, and 7 (the red square in the figure), reusing each fetched sample four times. -This way, eight samples are read from memory for 16 -multiplications, reducing the amount of memory operations already by a factor +This way, eight samples are read from memory for sixteen +multiplications, reducing the amount of memory operations by a factor of four. Correlating even higher numbers of receivers simultaneously would reduce the memory bandwidth usage further, but the maximum number of receivers that can @@ -739,24 +739,24 @@ architecture. %On the one hand, correlating and integrating over long periods of time %is good for pipelined FPU operation, on the other hand, the -Even when dividing the correlation triangle in tiles there still is +There still is opportunity for additional data reuse \emph{between} tiles. The tiles within a row or column in the triangle still need the same samples. In addition to registers, caches can thus also be used to increase data reuse. It is important to realize that the -correlator itself is \emph{trivially parallel}, since tens of thousands of -frequency channels can be processed independently. This allows us to +correlator itself is \emph{trivially parallel}, since the tens of thousands of +frequency channels that LOFAR uses can be processed independently. This allows us to efficiently exploit many-core hardware. We will now describe the implementation of the correlator on the different architectures. We evaluate the performance in detail. For comparison reasons, we use the performance \emph{per chip} for each architecture. -We choose 64 as the number of receivers, since +We choose 64 as the number of receivers (each consisting of hundreds of antennas), since that is a realistic number for LOFAR. Future instruments will likely -have even more receivers. +have even more receivers. The performance results are shown in Figure~\ref{performance-graph}. \begin{figure*}[t] \begin{center} @@ -768,7 +768,7 @@ have even more receivers. \end{figure*} -\subsection{Intel} +\subsection{Intel CPU} We use the SSE4 instruction set to exploit vector parallelism. Due to the limited shuffle instructions, computing the correlations of the @@ -778,14 +778,12 @@ found that, unlike on the other platforms, computing four samples with subsequent time stamps in a vector works better. The use of SSE4 improves the performance by a factor of 3.6 in this case. In addition, we use multiple threads to utilize all four cores. To -benefit from hyperthreading, we need twice as many threads as cores -(i.e., 8 in our case). Using more threads does not -help. Hyperthreading increases performance by 6\%. The most efficient -version uses a tile size of $2 \times 2$\. Larger tile sizes are inefficient -due to the small SSE4 register file. We achieve a performance of 48.0 -gflops, 67\% of the peak, while using 73\% of the peak bandwidth. +benefit from hyperthreading, we need twice as many threads as cores. +Hyperthreading increases performance by 6\%. The most efficient +version uses a tile size of $2 \times 2$. Larger tile sizes are inefficient +due to the small number of SSE4 registers. -\subsection{BG/P} +\subsection{BG/P supercomputer} The LOFAR production correlator is implemented on the Blue Gene/P platform. We use it as the reference for performance comparisons. @@ -793,20 +791,20 @@ The (assembly) code hides load and instruction latencies, issues concurrent floating point, integer, and load/store instructions, and uses the L2 prefetch buffers in the most optimal way. -We use a cell size of $2 \times 2$, since this offers the highest -level of reuse, while still fitting in the register file. +Like on the Intel CPU, we have to use a cell size of $2 \times 2$ due to +the small number of registers. The performance we achieve with this version is 13.1 gflops per chip, 96\% of the theoretical peak performance. The problem is compute bound, and not I/O bound, thanks to the high memory bandwidth per flop. -For more information, we refer to~\cite{sc09}. +%For more information, we refer to~\cite{sc09}. -\subsection{ATI} +\subsection{ATI GPU} ATI offers two separate programming models, at different abstraction levels. The low-level programming model is called the ``Compute Abstraction Layer'' (CAL). CAL provides communication primitives and -an intermediate assembly language, allowing fine-tuning of device +an assembly language, allowing fine-tuning of device performance. For high-level programming, ATI adopted \emph{Brook}, which was originally developed at Stanford~\cite{brook}. ATI's extended version is called \emph{Brook+}~\cite{amd-manual}. We @@ -816,20 +814,14 @@ With both Brook+ and CAL, the programmer has to do the vectorization, unlike with NVIDIA GPUs. CAL provides a feature called \emph{swizzling}, which is used to select parts of vector registers in arithmetic operations. We found this improves readability of the code -significantly. Unlike the other architectures, the ATI GPUs are not -well documented. Essential information, such as the number of -registers, cache sizes, and memory architecture is missing, making it -hard to write optimized code. Although the situation improved -recently, the documentation is still inadequate. Moreover, the -programming tools are insufficient. The high-level Brook+ model does +significantly. However, the +programming tools still are insufficient. The high-level Brook+ model does not achieve acceptable performance for our application. The low-level CAL model does, but it is difficult to use. The architecture also does not provide random write access to device memory. The kernel output can be written to at most 8 output registers -(each 4 floats wide). The hardware stores these to predetermined -locations in device memory. When using the output registers, at most -32 floating point values can be stored. This effectively limits the +(each 4 floats wide). This effectively limits the tile size to $2\times2$. Random write access to \emph{host} memory is provided. The correlator reduces the data by a large amount, and the results are never reused by the kernel. Therefore, they can be @@ -837,8 +829,8 @@ directly streamed to host memory. The best performing implementation uses a tile size of 4x3, thanks to the large number of registers. The kernel itself achieves 297 gflops, -which is 25\% of the theoretical peak performance. The achieved device -memory bandwidth is 81~GB/s, which is 70\% of the theoretical maximum. +which is 25\% of the theoretical peak performance. The performance is limited by +the device memory bandwidth. If we also take the host-to-device transfers into account, performance becomes much worse. We found that the host-to-device throughput is @@ -846,14 +838,11 @@ only 4.62 GB/s in practice, although the theoretical PCI-e bus bandwidth is 8 GB/s. The transfer can be done asynchronously, overlapping the computation with host-to-device communication. However, we discovered that the performance of the compute kernel -decreases significantly if transfers are performed concurrently. For -the $4\times3$ case, the compute kernel becomes 3.0 times slower, -which can be fully attributed to the decrease of device memory -throughput. Due to the low I/O performance, we achieve only 171 -gflops, 14\% of the theoretical peak. +decreases significantly if transfers are performed concurrently. +Due to the low I/O performance, we achieve only 14\% of the theoretical peak. -\subsection{NVIDIA} +\subsection{NVIDIA GPU} NVIDIA's programming model is called Cuda~\cite{cuda-manual}. Cuda is relatively high-level, and achieves good performance. @@ -863,8 +852,7 @@ An advantage of NVIDIA hardware and Cuda is that the application does not have t vectorization. This is thanks to the fact that all cores have their own address generation units. All data parallelism is expressed by using threads. -The correlator uses 128-bit reads to load a complex sample with two -polarizations with one instruction. Since random write access to +Since random write access to device memory is supported (unlike with the ATI hardware), we can simply store the output correlations to device memory. We use the texture cache to speed-up access to the sample data. We do not use it for the @@ -873,7 +861,7 @@ With Cuda, threads within a thread block can be synchronized. We exploit this feature to let the threads that access the same samples run in lock step. This way, we pay a small synchronization overhead, but we can increase the cache hit -ratio significantly. We found that this optimization improved performance by a factor of 2.0. +ratio significantly. We found that this optimization improved performance by a factor of 2. We also investigated the use of the per-multiprocessor shared memory as an application-managed cache. Others report good results with this @@ -888,26 +876,15 @@ The register file is a shared resource. A smaller tile size means less register which allows the use of more concurrent threads, hiding load delays. On NVIDIA hardware, we found that the using a relatively small tile size and many threads increases performance. -The kernel itself, without host-to-device communication achieves 285 -gflops, which is 31\% of the theoretical peak performance. The -achieved device memory bandwidth is 110~GB/s, which is 108\% of the -theoretical maximum. We can reach more than 100\% because we include data reuse. -The performance we get with the correlator is significantly -improved thanks to this data reuse, which we achieve by exploiting the texture cache. -The advantage is large, because separate bandwidth tests show that the theoretical -bandwidth cannot be reached in practice. Even in the most optimal case, only 71\% (72 GB/s) of the -theoretical maximum can be obtained. - -If we include communication, the performance -drops by 15\%, and we only get 243 gflops. Just like with the ATI hardware, -this is caused by the low PCI-e bandwidth. -With NVIDIA hardware and our data-intensive kernel, we do see significant -performance gains by using asynchronous I/O. With synchronous I/O, we achieve only -153 gflops. Therefore, the use of asynchronous I/O is essential. +The kernel itself, without host-to-device communication achieves 31\% +of the theoretical peak performance. If we include communication, the +performance drops to 26\% of the peak. Just like with the ATI +hardware, this is caused by the low PCI-e bandwidth. With NVIDIA +hardware and our data-intensive kernel, we do see significant +performance gains by using asynchronous I/O. -\subsection{Cell} - +\subsection{Cell/B.E.} The basic \mbox{Cell/B.E.} programming is based on multi-threading: the PPE spawns threads that execute asynchronously on SPEs. @@ -921,7 +898,7 @@ transfers~\cite{cell}. The \mbox{Cell/B.E.} can be programmed in C or C++, while using intrinsics to exploit vector parallelism. -The large number of registers (128 times 4 floats) allows a big tile size of +The large number of registers allows a big tile size of $4\times3$, leading to a lot of data reuse. We exploit the vector parallelism of the \mbox{Cell/B.E.} by computing the four polarization combinations in parallel. We found that this performs @@ -931,16 +908,6 @@ The shuffle instruction is executed in the odd pipeline, while the arithmetic is executed in the even pipeline, allowing them to overlap. -We identified a minor performance problem with the pipelines of the -\mbox{Cell/B.E.} Regrettably, there is no (auto)increment instruction in the odd -pipeline. Therefore, loop counters and address calculations have to -be performed on the critical path, in the even pipeline. In the time -it takes to increment a simple loop counter, four multiply-adds, or 8 -flops could have been performed. To circumvent this, we performed loop -unrolling in our kernels. This solves the performance problem, but has -the unwanted side effect that it uses local store memory, which is -better used as data cache. - A distinctive property of the architecture is that cache transfers are explicitly managed by the application, using DMA. This is unlike other architectures, where caches work transparently. @@ -960,11 +927,9 @@ Although issuing explicit DMA commands complicates programming, for our application this is not problematic. Due to the high -memory bandwidth and the ability to reuse data, we achieve 187 -gflops, including all memory I/O. This is 92\% of the peak -performance on one chip. If we use both chips in the cell blade, the -performance drops only with a small amount, and we still achieve -91\% (373 gflops) of the peak performance. Even though the memory +memory bandwidth and the ability to reuse data, we achieve 92\% of the peak +performance on one chip. If we use both chips in the cell blade, we still achieve +91\%. Even though the memory bandwidth per operation of the \mbox{Cell/B.E.} is eight times lower than that of the BG/P, we still achieve excellent performance, thanks to the high data reuse factor. @@ -979,13 +944,13 @@ the high data reuse factor. \begin{tabular}{l|l|l|l|l} Intel Core i7 920 & IBM Blue Gene/P & ATI 4870 & NVIDIA Tesla C1060 & STI Cell/B.E. \\ \hline -+ well-known & + L2 prefetch unit works well & + largest number of cores & + random write access & + explicit cache \\ ++ well-known & + L2 prefetch unit & + largest number of cores & + random write access & + explicit cache \\ & + high memory bandwidth & + swizzling support & + Cuda is high-level & + random write access \\ & & & & + shuffle capabilities \\ & & & & + power efficiency \\ & & & & \\ -- few registers & - everything double precision & - low PCI-e bandwidth & - low PCI-e bandwidth & - multiple parallelism levels \\ -- no fma & - expensive & - transfer slows down kernel & & - no increment in odd pipe \\ +- few registers & - double precision only & - low PCI-e bandwidth & - low PCI-e bandwidth & - multiple parallelism levels \\ +- no fma instruction & - expensive & - transfer slows down kernel & & - no increment in odd pipe \\ - limited shuffling & & - no random write access & & \\ & & - bad Brook+ performance & & \\ & & - CAL is low-level & & \\ @@ -1008,9 +973,7 @@ to that of the BG/P. Although the theoretical peak performance of the performance is only slightly less. If both chips in the QS21 blade are used, the \mbox{Cell/B.E.} also has the highest absolute performance. For the GPUs, it is possible to use more than one chip as -well. This can be done in the form of multiple PCI-e cards, or with -two chips on a single card, as is done with the ATI 4870x2 -device. However, we found that this does not help, since the +well, for instance with the ATI 4870x2 device. However, we found that this does not help, since the performance is already limited by the low PCI-e throughput, and the chips have to share this resource. The graph indeed shows that the @@ -1024,7 +987,7 @@ results are applicable to signal processing applications in general. -\section{Programmability} +\section{Programmability of the platforms} The performance gap between assembly and a high-level programming language is quite different for the different platforms. It also @@ -1034,11 +997,8 @@ etc., up to a level that the C code becomes almost as low-level as assembly code. The difference varies between only a few percent to a factor of 10. For the BG/P, the performance from compiled C++ code was by far not -sufficient. The assembly version hides load and instruction -latencies, issues concurrent floating point, integer, and load/store -instructions, and uses the L2 prefetch buffers in the most optimal -way. The resulting code is approximately 10 times faster than C++ -code. For both the Cell/B.E. and the Intel core~i7, we found that +sufficient. The assembly code is approximately 10 times faster. +For both the Cell/B.E. and the Intel core~i7, we found that high-level code in C or C++ in combination with the use of intrinsics to manually describe the SIMD parallelism yields acceptable performance compared to optimized assembly code. Thus, the programmer @@ -1052,6 +1012,7 @@ found that the high-level Brook+ model does not achieve acceptable performance compared to hand-written CAL code. Manually written assembly is more than three times faster. Also, the Brook+ documentation is insufficient. + \section{Aplying the techniques: a case study with the Intel Larrabee} Intel recently disclosed some details about the upcoming Larrabee processor, @@ -1069,11 +1030,11 @@ One option is to operate on 16~samples with consecutive time stamps. A minor drawback is that the data must be ``horizontally'' added to integrate, but this can be done outside the main loop. Another option is to operate on samples from 16~consecutive frequencies. -An advantage of this may be that the input is in the right order (i.e., -the 16~values can be read from consecutive memory locations) if a Poly-Phase -Filter precedes the correlator: the FFT outputs consecutive frequencies into -consecutive memory locations. -Both +%% An advantage of this may be that the input is in the right order (i.e., +%% the 16~values can be read from consecutive memory locations) if a Poly-Phase +%% Filter precedes the correlator: the FFT outputs consecutive frequencies into +%% consecutive memory locations. +%% Both Another option is to correlate samples from different receivers as illustrated by Figure~\ref{fig-correlation}. -- GitLab