diff --git a/doc/papers/2010/SPM/cover-letter.txt b/doc/papers/2010/SPM/cover-letter.txt index f83384f97c96b61f04612eb27f0f57f976630ebf..30d83d8e4d67d214446ff11417068785c8603931 100644 --- a/doc/papers/2010/SPM/cover-letter.txt +++ b/doc/papers/2010/SPM/cover-letter.txt @@ -1,9 +1,13 @@ We are very happy with the positive comments of the reviewers. -In the full paper, we were able to address all issues that were identified by the reviewers. +We were able to address all issues that were identified by the reviewers. Alg: -code on the web +new things: + +code open sources and on the web +section on programmability (Section 7) +??? reviewer 1 ---------- @@ -22,12 +26,16 @@ reviewer 1 > referenced and that the authors even try to "reinvent". This shall be > improved for full acceptance +By no means did we intend to claim that we developed new memory +optimization techniques, not did we try to reinvent them. Our aim was to simply +describe techniques we used for optimizing algorithms on many core +hardware, and we introduced the terminilogy as it is used in the GPGPU +field (e.g. "coalescing"). We now clearly state this in the paper. -By no means did we intend to claim that we developed new memory optimization techniques -or try to "reinvent" them. We simply descrite techniques we used for optimizing -algorithms on many core hardware, and introduce the terminilogy as it is used in the GPGPU -field (e.g. "coalescing"). However, we do agree that it is a very good idea -to provide more context. We added a complete new section ....@@@x +However, we certainly agree that it is a very good +idea to provide more context on memory-related optimizations, and +refer to the large body of research that was already done in this +area. We added a complete new section, Section 5.2.1 about this. reviewer 2 diff --git a/doc/papers/2010/SPM/spm.bib b/doc/papers/2010/SPM/spm.bib index 415dab328a59bc7bc3d8501a347a81859a53ed86..846f18044df8df9082f3ac57b143250d8a809ac2 100644 --- a/doc/papers/2010/SPM/spm.bib +++ b/doc/papers/2010/SPM/spm.bib @@ -788,16 +788,16 @@ year = {2000} } -@misc -{ - Bruyn:02, - title = {{Exploring the Universe with the Low Frequency Array, A Scientific Case}}, - author = {A.G. de Bruyn and others}, - note = {http://www.lofar.org/PDF/NL-CASE-1.0.pdf}, - month = {September}, - year = {2002} +@article{lofar, + author = "M.P. van Haarlem", + title = "LOFAR: The Low Frequency Array", + DOI= "10.1051/eas:2005169", + note = {\url{http://dx.doi.org/10.1051/eas:2005169}}, + journal = "European Astronomical Society Publications Series", + year = 2005, + volume = 15, + pages = "431-444", } - author = {A.G. de Bruyn and R.P. Fender and J.M.E. Kuijpers and G.K. Miley and R. Ramachandran and H.J.A. R\"ottgering and B.W. Stappers and {M.A.M. van de} Weygaert and {M.P. van} Haarlem}, @phdthesis { diff --git a/doc/papers/2010/SPM/spm.tex b/doc/papers/2010/SPM/spm.tex index 25bf179b7c45e89e4dd0019acd0f03802fcdabe7..40778c9c2ccaae5bdb36896aec51cd9fb45a35e9 100644 --- a/doc/papers/2010/SPM/spm.tex +++ b/doc/papers/2010/SPM/spm.tex @@ -270,7 +270,7 @@ The cost is dominated by the cost of computing and will follow Moore's law, becoming cheaper with time and allowing increasingly large telescopes to be built. -\longversion{ + LOFAR will enable exciting new science cases. First, we expect to see the \emph{Epoch of Reionization\/} (EoR), the time that the first star galaxies and quasars were formed. Second, LOFAR offers a unique @@ -284,9 +284,8 @@ switch focus to some galactic event. Fourth, \emph{Deep Extragalactic galaxies and study star-forming galaxies. Fifth, LOFAR will be capable of observing the so far unexplored radio waves emitted by \emph{cosmic magnetic fields}. For a more extensive description of -the astronomical aspects of the LOFAR system, see De Bruyn -et.~al.~\cite{Bruyn:02}. -} +the astronomical aspects of the LOFAR system, see~\cite{lofar}. + A global overview of the LOFAR processing is given in Figure~\ref{fig:lofar-overview}. The thickness of the lines indicates the size of the data streams. Initial processing is done in the @@ -347,7 +346,6 @@ numbers: two polarizations, each with a real and an imaginary part. LOFAR uses an FX correlator: it first filters the different frequencies, and then correlates the signals. This is more efficient than an XF correlator for larger numbers of receivers. -\longversion{ Prior to correlation, the data that comes from the receivers must be reordered: each input carries the signals of many frequency bands from a single @@ -357,7 +355,8 @@ The data reordering phase is outside the scope of this paper, but a correlator implementation cannot ignore this issue. The LOFAR Blue Gene/P correlator uses the fast 3D~torus for this purpose; other multi-core architectures need external switches. -} + + The received signals from sky sources are so weak, that the antennas mainly receive noise. To see if there is statistical coherence in the noise, simultaneous samples of each pair of receivers are correlated, @@ -541,13 +540,11 @@ and is managed \emph{entirely by the application} with explicit DMA transfers to and from main memory. The LS can be considered the SPU's (explicit) L1 cache. The \mbox{Cell/B.E.} has a large number of registers: each SPU has 128, which are 128-bit (4 floats) wide. -\longversion{ The SPU can dispatch two instructions in each clock cycle using the two pipelines designated \emph{even} and \emph{odd}. Most of the arithmetic instructions execute on the even pipe, while most of the memory instructions execute on the odd pipe. -} For the performance evaluation, we use a QS21 Cell blade with two \mbox{Cell/B.E.} processors. The 8 SPEs of a single chip in the @@ -1043,13 +1040,11 @@ ratio significantly. We found that this optimization improved performance by a This optimization is a good example that shows that, on GPUs, it is important to optimize memory behavior, even at the cost of additional instructions and synchronization overhead. -\longversion{ We also investigated the use of the per-multiprocessor shared memory as an application-managed cache. Others report good results with this approach~\cite{gpu-cache}. However, we found that, for our application, the use of shared memory only led to performance -degradation. -} +degradation compared to the use of the texture caches. Registers are a shared resource. Using fewer registers in a kernel allows the use of more concurrent threads, hiding load delays. @@ -1155,7 +1150,6 @@ architectural strengths and weaknesses that we discussed. %@@@ larrabee / lange vectoren -\longversion{ \section{Programmability of the platforms} The performance gap between assembly and a high-level programming language @@ -1180,7 +1174,6 @@ should be kept in registers. With ATI hardware, this is different. We found that the high-level Brook+ model does not achieve acceptable performance compared to hand-written CAL code. Manually written assembly is more than three times faster. Also, the Brook+ documentation is insufficient. -} \longversion{ \section{Applying the techniques: a case study with the Intel Larrabee}