Skip to content
Snippets Groups Projects
Commit b76cc953 authored by Rob van Nieuwpoort's avatar Rob van Nieuwpoort
Browse files

Bug 1198: longversion stukken teruggezet om de 10 paginas te vullen

parent dab1ebeb
No related branches found
No related tags found
No related merge requests found
We are very happy with the positive comments of the reviewers. We are very happy with the positive comments of the reviewers.
In the full paper, we were able to address all issues that were identified by the reviewers. We were able to address all issues that were identified by the reviewers.
Alg: Alg:
code on the web new things:
code open sources and on the web
section on programmability (Section 7)
???
reviewer 1 reviewer 1
---------- ----------
...@@ -22,12 +26,16 @@ reviewer 1 ...@@ -22,12 +26,16 @@ reviewer 1
> referenced and that the authors even try to "reinvent". This shall be > referenced and that the authors even try to "reinvent". This shall be
> improved for full acceptance > improved for full acceptance
By no means did we intend to claim that we developed new memory
optimization techniques, not did we try to reinvent them. Our aim was to simply
describe techniques we used for optimizing algorithms on many core
hardware, and we introduced the terminilogy as it is used in the GPGPU
field (e.g. "coalescing"). We now clearly state this in the paper.
By no means did we intend to claim that we developed new memory optimization techniques However, we certainly agree that it is a very good
or try to "reinvent" them. We simply descrite techniques we used for optimizing idea to provide more context on memory-related optimizations, and
algorithms on many core hardware, and introduce the terminilogy as it is used in the GPGPU refer to the large body of research that was already done in this
field (e.g. "coalescing"). However, we do agree that it is a very good idea area. We added a complete new section, Section 5.2.1 about this.
to provide more context. We added a complete new section ....@@@x
reviewer 2 reviewer 2
......
...@@ -788,16 +788,16 @@ ...@@ -788,16 +788,16 @@
year = {2000} year = {2000}
} }
@misc @article{lofar,
{ author = "M.P. van Haarlem",
Bruyn:02, title = "LOFAR: The Low Frequency Array",
title = {{Exploring the Universe with the Low Frequency Array, A Scientific Case}}, DOI= "10.1051/eas:2005169",
author = {A.G. de Bruyn and others}, note = {\url{http://dx.doi.org/10.1051/eas:2005169}},
note = {http://www.lofar.org/PDF/NL-CASE-1.0.pdf}, journal = "European Astronomical Society Publications Series",
month = {September}, year = 2005,
year = {2002} volume = 15,
pages = "431-444",
} }
author = {A.G. de Bruyn and R.P. Fender and J.M.E. Kuijpers and G.K. Miley and R. Ramachandran and H.J.A. R\"ottgering and B.W. Stappers and {M.A.M. van de} Weygaert and {M.P. van} Haarlem},
@phdthesis @phdthesis
{ {
......
...@@ -270,7 +270,7 @@ The cost ...@@ -270,7 +270,7 @@ The cost
is dominated by the cost of computing and will follow Moore's law, is dominated by the cost of computing and will follow Moore's law,
becoming cheaper with time and allowing increasingly large telescopes becoming cheaper with time and allowing increasingly large telescopes
to be built. to be built.
\longversion{
LOFAR will enable exciting new science cases. First, we expect to see LOFAR will enable exciting new science cases. First, we expect to see
the \emph{Epoch of Reionization\/} (EoR), the time that the first star the \emph{Epoch of Reionization\/} (EoR), the time that the first star
galaxies and quasars were formed. Second, LOFAR offers a unique galaxies and quasars were formed. Second, LOFAR offers a unique
...@@ -284,9 +284,8 @@ switch focus to some galactic event. Fourth, \emph{Deep Extragalactic ...@@ -284,9 +284,8 @@ switch focus to some galactic event. Fourth, \emph{Deep Extragalactic
galaxies and study star-forming galaxies. Fifth, LOFAR will be galaxies and study star-forming galaxies. Fifth, LOFAR will be
capable of observing the so far unexplored radio waves emitted by capable of observing the so far unexplored radio waves emitted by
\emph{cosmic magnetic fields}. For a more extensive description of \emph{cosmic magnetic fields}. For a more extensive description of
the astronomical aspects of the LOFAR system, see De Bruyn the astronomical aspects of the LOFAR system, see~\cite{lofar}.
et.~al.~\cite{Bruyn:02}.
}
A global overview of the LOFAR processing is given in A global overview of the LOFAR processing is given in
Figure~\ref{fig:lofar-overview}. The thickness of the lines indicates Figure~\ref{fig:lofar-overview}. The thickness of the lines indicates
the size of the data streams. Initial processing is done in the the size of the data streams. Initial processing is done in the
...@@ -347,7 +346,6 @@ numbers: two polarizations, each with a real and an imaginary part. ...@@ -347,7 +346,6 @@ numbers: two polarizations, each with a real and an imaginary part.
LOFAR uses an FX correlator: it first filters the different frequencies, and LOFAR uses an FX correlator: it first filters the different frequencies, and
then correlates the signals. This is more efficient than an XF correlator for larger numbers of receivers. then correlates the signals. This is more efficient than an XF correlator for larger numbers of receivers.
\longversion{
Prior to correlation, the data that comes from Prior to correlation, the data that comes from
the receivers must be reordered: the receivers must be reordered:
each input carries the signals of many frequency bands from a single each input carries the signals of many frequency bands from a single
...@@ -357,7 +355,8 @@ The data reordering phase is outside the scope of this paper, but a correlator ...@@ -357,7 +355,8 @@ The data reordering phase is outside the scope of this paper, but a correlator
implementation cannot ignore this issue. implementation cannot ignore this issue.
The LOFAR Blue Gene/P correlator uses the fast 3D~torus for this purpose; The LOFAR Blue Gene/P correlator uses the fast 3D~torus for this purpose;
other multi-core architectures need external switches. other multi-core architectures need external switches.
}
The received signals from sky sources are so weak, that the antennas The received signals from sky sources are so weak, that the antennas
mainly receive noise. To see if there is statistical coherence mainly receive noise. To see if there is statistical coherence
in the noise, simultaneous samples of each pair of receivers are correlated, in the noise, simultaneous samples of each pair of receivers are correlated,
...@@ -541,13 +540,11 @@ and is managed \emph{entirely by the application} with explicit DMA ...@@ -541,13 +540,11 @@ and is managed \emph{entirely by the application} with explicit DMA
transfers to and from main memory. The LS can be considered the SPU's (explicit) L1 cache. The transfers to and from main memory. The LS can be considered the SPU's (explicit) L1 cache. The
\mbox{Cell/B.E.} has a large number of registers: each SPU has 128, \mbox{Cell/B.E.} has a large number of registers: each SPU has 128,
which are 128-bit (4 floats) wide. which are 128-bit (4 floats) wide.
\longversion{
The SPU can dispatch two The SPU can dispatch two
instructions in each clock cycle using the two pipelines designated instructions in each clock cycle using the two pipelines designated
\emph{even} and \emph{odd}. Most of the arithmetic instructions \emph{even} and \emph{odd}. Most of the arithmetic instructions
execute on the even pipe, while most of the memory instructions execute on the even pipe, while most of the memory instructions
execute on the odd pipe. execute on the odd pipe.
}
For the performance evaluation, we use a QS21 Cell blade with two For the performance evaluation, we use a QS21 Cell blade with two
\mbox{Cell/B.E.} processors. \mbox{Cell/B.E.} processors.
The 8 SPEs of a single chip in the The 8 SPEs of a single chip in the
...@@ -1043,13 +1040,11 @@ ratio significantly. We found that this optimization improved performance by a ...@@ -1043,13 +1040,11 @@ ratio significantly. We found that this optimization improved performance by a
This optimization is a good example that shows that, on GPUs, it is important to optimize This optimization is a good example that shows that, on GPUs, it is important to optimize
memory behavior, even at the cost of additional instructions and synchronization overhead. memory behavior, even at the cost of additional instructions and synchronization overhead.
\longversion{
We also investigated the use of the per-multiprocessor shared memory as an We also investigated the use of the per-multiprocessor shared memory as an
application-managed cache. Others report good results with this application-managed cache. Others report good results with this
approach~\cite{gpu-cache}. However, we found that, for our approach~\cite{gpu-cache}. However, we found that, for our
application, the use of shared memory only led to performance application, the use of shared memory only led to performance
degradation. degradation compared to the use of the texture caches.
}
Registers are a shared resource. Using fewer registers in a kernel Registers are a shared resource. Using fewer registers in a kernel
allows the use of more concurrent threads, hiding load delays. allows the use of more concurrent threads, hiding load delays.
...@@ -1155,7 +1150,6 @@ architectural strengths and weaknesses that we discussed. ...@@ -1155,7 +1150,6 @@ architectural strengths and weaknesses that we discussed.
%@@@ larrabee / lange vectoren %@@@ larrabee / lange vectoren
\longversion{
\section{Programmability of the platforms} \section{Programmability of the platforms}
The performance gap between assembly and a high-level programming language The performance gap between assembly and a high-level programming language
...@@ -1180,7 +1174,6 @@ should be kept in registers. With ATI hardware, this is different. We ...@@ -1180,7 +1174,6 @@ should be kept in registers. With ATI hardware, this is different. We
found that the high-level Brook+ model does not achieve acceptable found that the high-level Brook+ model does not achieve acceptable
performance compared to hand-written CAL code. Manually written assembly performance compared to hand-written CAL code. Manually written assembly
is more than three times faster. Also, the Brook+ documentation is insufficient. is more than three times faster. Also, the Brook+ documentation is insufficient.
}
\longversion{ \longversion{
\section{Applying the techniques: a case study with the Intel Larrabee} \section{Applying the techniques: a case study with the Intel Larrabee}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment