From efbb7c16fe2256e893a0c0bb43592da945acd848 Mon Sep 17 00:00:00 2001 From: Eric Kooistra <kooistra@astron.nl> Date: Thu, 15 Oct 2020 20:07:05 +0200 Subject: [PATCH] Weekly commit. --- .../doc/prestudy/desp_howtools_erko.txt | 52 ++- .../lofar2/doc/prestudy/station2_sdp_dsp.txt | 70 ++- .../prestudy/station2_sdp_firmware_design.txt | 15 +- .../prestudy/station2_sdp_hdl_components.txt | 32 +- .../lofar2/doc/prestudy/station2_sdp_icd.txt | 1 + .../lofar2/doc/prestudy/station2_sdp_ring.txt | 433 +++--------------- .../lofar2/doc/prestudy/station2_sdp_srs.txt | 4 +- 7 files changed, 233 insertions(+), 374 deletions(-) diff --git a/applications/lofar2/doc/prestudy/desp_howtools_erko.txt b/applications/lofar2/doc/prestudy/desp_howtools_erko.txt index 53687f6236..7bdc861dd8 100755 --- a/applications/lofar2/doc/prestudy/desp_howtools_erko.txt +++ b/applications/lofar2/doc/prestudy/desp_howtools_erko.txt @@ -12,6 +12,7 @@ * Quartus Qsys IP files in GIT * Quartus version * Linux +* ICT diensten @@ -340,7 +341,11 @@ for me it is an ok workaround, * Markdown ******************************************************************************* -Use e.g. Linux 'retext' as markdown editor and previewer. Markdown is not well +Use e.g.: +- Linux 'retext' as markdown editor and previewer. +- https://typora.io + +Markdown is not well defined, so it is not always sure that the text will appear as expected in each viewer. For example retext and gitlab viewer differ. Therefore it is important to keep the markdown simple and to accept that readable text is good @@ -367,6 +372,10 @@ __bold__ `boxed` ~~strike through~~ +```vhdl +Text in ascii VHDL style for GitLab +``` + Block quotes (alinea with an indent bar): > Block text will wrap @@ -656,6 +665,16 @@ Quartus version meeting minutes 13 may 2020 (RW, LH JH, EK): https://linuxize.com/ +# Linux update via +# - system updates available icon and notifications icon in toolbar +# - of via command line: +> uname -a # linux info +> sudo -s # become root +> apt-get upgrade +> apt-get dist-upgrade +> apt autoremove + + dop466 = SSD dop466_0 = HDD @@ -693,3 +712,34 @@ ls -l filename # shows current user,group owners of the 'filena sudo chgrp software filename # change group of 'filename' to 'software' sudo chgrp -R software dirname # recursively change group of 'dirname' to 'software' #chown # change user,group + + +******************************************************************************* +* ICT diensten +******************************************************************************* + +Self Service Password Reset + +Met Self Service Password Reset (SSPR) is het mogelijk voor gebruikers om zelf hun wachtwoord +opnieuw instellen voor diverse ldap-diensten zonder tussenkomst van ict, zoals bijvoorbeeld de +netwerkschijven (H en I), intranet, vpn, Confluence, Jira en Surfmarkt. +Door middel van uitdagingsvragen kan een gebruiker bevestigen wie hij/zij is voordat ze hun +wachtwoorden veilig kunnen resetten. +Link: https://sspr.astron.nl + +SURFfilesender +SURFfilesender is een webgebaseerde applicatie waarmee geauthenticeerde gebruikers veilig en +gemakkelijk willekeurig grote bestanden naar andere gebruikers kunnen verzenden. Gebruikers zonder +een account kunnen een gastvoucher worden gestuurd door een geverifieerde gebruiker. SURFfilesender +is ontwikkeld volgens de eisen en wensen van onderwijs en onderzoek. +Link: https://filesender.surf.nl + +edu.nl: privacy-vriendelijke URL-shortener +edu.nl is dé URL-shortener voor onderwijs en onderzoek. Veilig omdat gebruikers inloggen met +SURFconext. Privacy-vriendelijk omdat edu.nl geen persoonlijke gegevens van gebruikers opslaat en +bezoekers van links niet trackt. edu.nl is kosteloos voor alle bij SURF aangesloten instellingen. +Link: https://filesender.surf.nl + +Een uitgebreide beschrijving van deze diensten (Home » Diensten » ICT » Manuals and Documents) +is te vinden op het intranet +https://intranet.astron.nl/diensten/ict/manuals-and-documents/manuals-and-documents diff --git a/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt b/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt index 8f0b2dfce9..66f539a79c 100755 --- a/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt +++ b/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt @@ -29,6 +29,20 @@ Fsub - use MM master mux to select between MM access and UDP offload, when UDP offload is enabled then do not do MM access. +Support for oversampled subband filterbank +The oversampling increases the processing rate and data rate by a factor R_os. Typical R_os are +32/28 = 1.142, 32/27 = 1.185, 32/26 = 1.231, 32/25 = 1.28, 32/24 = 1.333. Assume R_os <= 1.28. + + +Processing capacity per subband period: +Assume the processing for critically sampled filterbank runs at 200 MHz and for oversampled +subbands it will run at R_os * 200 MHz. For R_os = 1.28 this requires processing at >= 256 MHz. +This means that the processing has N_clk = N_fft = 1024 clock cycles avaiable per subband +period T_sub, independent of R_os. In this way if the processing for the critically sampled +subbands fits within N_clk = N_fft = 1024 clock cycles, then it will also fit for the +oversampled subbands. + + ******************************************************************************* * Beamformer ******************************************************************************* @@ -68,7 +82,7 @@ M&C: |wx 0| |x| |wx*x| | 0 wy| * |y| = |wy*y| - . Wsing cx = wx and cy = wy and wx /= wy allows making two independent unpolarized beams using all antenne elements: + . Using cx = wx and cy = wy and wx /= wy allows making two independent unpolarized beams using all antenne elements: |wx 0| |1 1| |x| |wx wx| |x| |wx * (x + y)| | 0 wy| * |1 1| * |y| = |wy wy| * |y| = |wy * (x + y)| @@ -76,6 +90,60 @@ M&C: The (x+y) could be implemented as first (x+y) and then *w, or as first weight and then add. +W_beamlet_sum +LOFAR 1.0 had 24 bit for 16 bit beamlet mode and 12 bit for 8 bit beamlet mode. LOFAR 2.0 will +only support 8 bit. Using W_beamlet_sum = 18 bit provides 5 bits more dynamic range for 8 bit +beamlet mode, which is sufficient to detect overflow. Using W_beamlet_sum = 18 bit also fits the +input data width of the FPGA hard core multipliers in the BST. Given that the SDP signal input +level is 4 bit the beamformer could round 2 LSbit to effectively achieve 20 bit dynamic range, +even for S = 1 signal input. However the same effect can also be achieved by reducing the beamlet +weights by a factor 2**2 = 4. Choose the same W_beamlet_sum = 18 bit for both the critically +sampled beamlet data and the oversampled beamlet data, to avoid differences in the design. + +In the dp_clk domain: +Using 488 yields 512 / 488 = 4.9 % margin +Using 496 yields 512 / 496 = 3.2 % margin +Using 512 has no margin, so requires higher dp_clk rate to be able to insert headers. + +On the Ring lane: +Using s_sub_bf = 488 yields 6.8625 Gbps, so 1 - 6.8625 / 7.8125 = 12.1% margin +Using s_sub_bf = 496 yields 6.975 Gbps, so 1 - 6.975 / 7.8125 = 10.7% margin +Using s_sub_bf = 512 yields 7.2 Gbps, so 1 - 7.2 / 7.8125 = 7.8% margin + +Design decsision: + Use dp_clk = 200 MHz, so do not overclock to support S_sub_bf = 512. It may be feasible to + support S_sub_bf = 496, but assume 488 because that is required. + +Design descision: + Use W_beamlet_sum = 18 bit for both critically sampled beamlet and oversampled beamlets. + Using W_beamlet_sum = 18 bit fits the on one 10GbE lane on the ring, fits the input data width + of the FPGA hard core multipliers and also provides sufficent dynamic range to scale the final + beamlet sum to W_beamlet = 8 bit for output. + + +W_beamlet_sum = 18 bit +The beamlet sum that is transported across the ring needs to fit on a 10GbE lane. For one beamset of +S_sub_bf = 488 beamlets the data rate is N_pol * S_sub_bf * f_sub * N_complex * +W_beamlet_sum = 2 * 488 * 195312.5 * 2 * 18 = 6.8625 Gbps. Using L_lane = 7.8125 Gbps this leaves +about 1 - 6.8625 / 7.8125 = 12% margin for packet overhead, which is sufficient. + +What is the beamlet packet size? +The beamlet sum is passed on along the ring from start PN to end PN using ring access scheme 1. At +the end PN the final beamlet sum is scaled to W_beamlet = 8 bit and output to CEP. The intermediate +beamlet sum has W_beamlet = 18 bit and is complex. There are N_pol * S_sub_bf = 2 * 488 = 976 +beamlets per packet. The payload size is N_pol * S_sub_bf * N_complex * W_beamlet_sum / W_byte = +2 * 488 * 2 * 18 / 8 = 4392 octets. The effective packet size is 60 + 4392 = 4452 octets. With +f_sub = 195312.5 Hz the data rate is 4452 * 195312.5 * 8 = 6.95625 Gbps < L_lane = 7.8125, so it +fits on a 10GbE lane. + +RW: +De output van de beamlet subband select wordt gecopieerd binnen local BF = sdp_beamformer_local.vhd, dus het gaat naar een X polarizatie tak in de local BF en hetzelfde gaat naar de Y polarizatie tak. De BF weight waardes voor sdp_bf_weights.vhd zullen dan bepalen dat de ene tak X pol beamlets maakt en dat andere tak Y pol beamlets maakt. N_pol = 2 en pol 0 = X en pol 1 = Y. De subbanden en beamlets zijn complex (re/im), dat ze complex zijn heeft niks met polarizatie te maken. + +Figuur 3.4 sdp_bf_weights als 1 plat 12 inputs block met gecopieerd aangesloten 2x 6 inputs tonen. Sectie 4.1 [N_pol] index is ok. + +Elke antenne heeft een X pol en een Y pol. De BF maakt X pol beamlets (ene tak) en Y pol beamlets (andere tak). Bijzonder is dat we met de BF weights zowel X als Y pol antennes kunnen laten bijdragen aan de X pol beamlet en aan een Y pol beamlet. Hierdoor zou je cross-polarizatie correcties kunnen doen. + + Text saved in case we do need time actived BF weigths using the BSN scheduler: diff --git a/applications/lofar2/doc/prestudy/station2_sdp_firmware_design.txt b/applications/lofar2/doc/prestudy/station2_sdp_firmware_design.txt index 767e295f74..bd68a9401f 100755 --- a/applications/lofar2/doc/prestudy/station2_sdp_firmware_design.txt +++ b/applications/lofar2/doc/prestudy/station2_sdp_firmware_design.txt @@ -243,4 +243,17 @@ WP 5 SDP plan: --> https://support.astron.nl/confluence/display/STAT/WP-5+SDP Other: . tools/oneclick/doc/desp_firmware_dag_erko.txt . tools/oneclick/doc/desp_firmware_overview.txt - . desp_howtools_erko.txt \ No newline at end of file + . desp_howtools_erko.txt + + +******************************************************************************* +* Design decision document +******************************************************************************* + +Opbouw: +- sectie 2.1 Wat is nodig, wat hebben we nu, randvoorwaarden +- sectie 3.1 Wat is optie A design +- sectie 3.2 Wat is optie B design +- ... +- sectie 3.3 Vergelijking vd opties A, B, ... + diff --git a/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt b/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt index e44ee2e19a..ba8409be52 100755 --- a/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt +++ b/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt @@ -26,12 +26,31 @@ . mon_bsn_first = BSN at first Rx sync --> not useful . mon_bsn_first_cycle_cnt = latency at first Rx sync --> should use every Rx sync like on RSP - ==> Reuse dp_bsn_monitor with improvements: - . Monitor the packets per sync interval using Rx sync. This is more precise then using the PPS - sync. The Rx sync based values are only valid if mon_sync_timeout = 0. - . Remove mon_bsn_first and mon_bsn_first_cycle_cnt. - . Add mon_latency, use PPS sync like in RSP to measure the latency between PPS sync and Rx - sync in number of clock cycles. + ==> Reuse dp_bsn_monitor_v2 with improvements: + . Monitor the packets per sync interval using Rx sync. This is more precise then using the PPS sync, because + with the Rx sync the number of packets does not depend on latency fluctuations. The packet with the Rx sync + can get lost, but this can be detected by Rx sync timeout (= mon_sync_timeout). Therefore the Rx sync based + values are only valid if mon_sync_timeout = 0. + . Monitor the channel field at the Rx sync. The channel field identifies the source rn index and the destination + rn index of the packet. + --> Add mon_channel + . Monitor the hop latency of the packets (= mon_latency) with respect to the PPS sync like in RSP. The + hop latency is measured by countinf the number of clock cycles between PPS sync and Rx sync. Hence the hop + latency is measured for the first packet in the sync interval. This provides sufficient information on the + hop latency of all packets in the sync interval. + --> Add mon_latency. + . Remove mon_bsn_first and mon_bsn_first_cycle_cnt (so remove IN sync_in). + . Use a dp_bsn_monitor_v2 per channel source rn index + * One for the local channel + * Use a dp_bsn_monitor_v2 on Rx per channel source rn index, there are K channels/lane: + - combiner scheme : K = 1, for Rx from previous node + - end cast scheme : K <= N-1, for Rx from up to K nodes on the ring + - multi cast scheme : K <= N-1, for Rx from up to K nodes on the ring + * Use a dp_bsn_monitor_v2 on Tx per channel, there are K channels/lane: + - combiner scheme : K = 1, for Tx to next node + - end cast scheme : K <= N-1, for Tx to up to K nodes on the ring (transit packets and local packet) + - multi cast scheme : K <= N-1, for Tx to up to K nodes on the ring (transit packets and local packet) + @@ -68,6 +87,7 @@ Therefore the application payload should also have a CRC to ensure that no false positive CRC will occur during the life time of LOFAR 2.0. + UniBoard2 BER < 1 * 10-14 for all 10G transceives https://www.worldscientific.com/doi/10.1142/S225117171950003X Design decisions: - Use CHAN (32b), Sync & BSN (64b), DATA (>= 1 b), ERR (32b), CRC (32b) to transport data between diff --git a/applications/lofar2/doc/prestudy/station2_sdp_icd.txt b/applications/lofar2/doc/prestudy/station2_sdp_icd.txt index 0716805b5e..bbae5cec5c 100755 --- a/applications/lofar2/doc/prestudy/station2_sdp_icd.txt +++ b/applications/lofar2/doc/prestudy/station2_sdp_icd.txt @@ -699,6 +699,7 @@ Enianess: The Nios II architecture uses little-endian byte ordering. Words and halfwords are stored inmemory with the more-significant bytes at higher addresses. + ################################################################################################### # L3 ICD 11423 SDPTR - SDPFW diff --git a/applications/lofar2/doc/prestudy/station2_sdp_ring.txt b/applications/lofar2/doc/prestudy/station2_sdp_ring.txt index 99ea252384..bd024c59b9 100755 --- a/applications/lofar2/doc/prestudy/station2_sdp_ring.txt +++ b/applications/lofar2/doc/prestudy/station2_sdp_ring.txt @@ -1,181 +1,11 @@ Detailed design: RING -******************************************************************************* -* Data rate -******************************************************************************* - -Support for oversampled subband filterbank -The oversampling increases the processing rate and data rate by a factor R_os. Typical R_os are -32/28 = 1.142, 32/27 = 1.185, 32/26 = 1.231, 32/25 = 1.28, 32/24 = 1.333. Assume R_os <= 1.28. - -Processing capacity per subband period: -Assume the processing for critically sampled filterbank runs at 200 MHz and for oversampled -subbands it will run at R_os * 200 MHz. For R_os = 1.28 this requires processing at >= 256 MHz. -This means that the processing has N_clk = N_fft = 1024 clock cycles avaiable per subband -period T_sub, independent of R_os. In this way if the processing for the critically sampled -subbands fits within N_clk = N_fft = 1024 clock cycles, then it will also fit for the -oversampled subbands. - -IO capacity per 10GbE lane: -The IO data rate on the ring increases with the oversampling factor R_os. For oversampled data -the ring 10GbE has the full 10 Gbps capacity and for critically sampled data the effective -ring capacity per lane becomes L_lane = 10G / R_os = 10G / 1.28 = 7.8125 Gbps. The aim is to -be able to replace the critically sampled filterbank by an oversampled filterbank without -having to change other parts in the design. Therefore assume that the ring capacity for the -critically sampled data is restricted to L_lane < 7.8125 Gbps. - -Note: -The alternative to use full ring capacity for critically sampled data and then support less -(S_sub_bf / R_os = 488 / 1.28 = 381, so almost 30% less) beamlets for oversampled data is not -compliant with the requirement of S_sub_bf = 488. - -Design descision: Support S_sub_bf = 488 also for maximum R_os = 1.28. - - -W_beamlet_sum -LOFAR 1.0 had 24 bit for 16 bit beamlet mode and 12 bit for 8 bit beamlet mode. LOFAR 2.0 will -only support 8 bit. Using W_beamlet_sum = 18 bit provides 5 bits more dynamic range for 8 bit -beamlet mode, which is sufficient to detect overflow. Using W_beamlet_sum = 18 bit also fits the -input data width of the FPGA hard core multipliers in the BST. Given that the SDP signal input -level is 4 bit the beamformer could round 2 LSbit to effectively achieve 20 bit dynamic range, -even for S = 1 signal input. However the same effect can also be achieved by reducing the beamlet -weights by a factor 2**2 = 4. Choose the same W_beamlet_sum = 18 bit for both the critically -sampled beamlet data and the oversampled beamlet data, to avoid differences in the design. The -beamlet sum that is transported across the ring needs to fit on a 10GbE lane. With S_sub_bf = 488 -the data rate for one full band station beam is N_pol * S_sub_bf * f_sub * N_complex * -W_beamlet_sum = 2 * 488 * 195312.5 * 2 * 18 = 6.8625 Gbps. Using L_lane = 7.8125 Gbps this leaves -about 1 - 6.8625 / 7.8125 = 12% margin for packet overhead, which is sufficient. - -On the lane: -Using s_sub_bf = 488 yields 6.8625 Gbps, so 1 - 6.8625 / 7.8125 = 12.1% margin -Using s_sub_bf = 496 yields 6.975 Gbps, so 1 - 6.975 / 7.8125 = 10.7% margin -Using s_sub_bf = 512 yields 7.2 Gbps, so 1 - 7.2 / 7.8125 = 7.8% margin - -In the dp_clk domain: -Using 488 yields 512 / 488 = 4.9 % margin -Using 496 yields 512 / 496 = 3.2 % margin -Using 512 has no margin, so requires higher dp_clk rate to be able to insert headers. - -Design decsision: - Use dp_clk = 200 MHz, so do not overclock to support S_sub_bf = 512. It may be feasible to - support S_sub_bf = 496, but assume 488 because that is required. - -Design descision: - Use W_beamlet_sum = 18 bit for both critically sampled beamlet and oversampled beamlets. - Using W_beamlet_sum = 18 bit fits the on one 10GbE lane on the ring, fits the input data width - of the FPGA hard core multipliers and also provides sufficent dynamic range to scale the final - beamlet sum to W_beamlet = 8 bit for output. - - - ******************************************************************************* * Ring links: ******************************************************************************* -OSI 1 Phyisical layer: Transceivers - -OSI 2 Data link layer: -Use Ethernet per transceiver link.The Ethernet MAC provides link establishment, so it uses a full -duplex transceiver. The Ethernet packet header contains destination MAC address, source MAC -address and Ethernet type. The Ethernet packet tail contains a CRC. The CRC provides data error -detection. There is no need to use UDP/IP and ARP, because the links in the ring are point to -point and will not be used in a network. The Ethernet fields can be used as: - -- Destination MAC = destination PN index -- Source MAC = source PN index -- Ethernet type = packet type - -Design decision: Use Ethernet for the ring transceiver links - - -Use 10GbE or 40GbE: -From the low-latency Ethernet core user guides it follows that the Ethernet core with statistics -registers use: - -- 10GbE core : 4300 FF, 4 M9K -- 40GbE core : 21200 FF, 13 M20K - -The synhesis fitter results from Apertif BF and XC show that the tech_eth_10g takes about 5500 FF -and 4 (BF) or 7 (XC) M9K.The BF MAC has no statistics, the XC MAC does have statistics. -Hence the 40GbE core is about a factor 4 larger than the 10GbE core, so from a resource usage point -of view it is does not matter whether we use 4 x 10GbE or 1 x 40GbE. The advantage of 40GbE is -that it can fit data rates > 10Gbps per data type stream. The advantage of using 10GbE is that we -can use one link per data type stream and thereby avoid having to multiplex different data streams -onto the same 40GbE link. However some multiplexing of local packets and remote transit packets can -also be needed. UniBoard2 has been tested with 10GbE but not yet with 40GbE. -The Arria10 on UniBoard2 has 1708800 FF so 1708800 / 182400 = 9.3 times more than the Stratix IV -on UniBoard1. On UniBoard2 one 10GbE interface uses maximum about 5500 / 1708800 = 0.32% of the -FF and maximum about 7 / 2713 = 0.25% of the block RAM. In total there will be 4 x 10GbE for the -intra board ring, 4 x 10GbE for the inter board ring and 1 x 10GbE for external IO, so these will -take about 3% of the FF and block RAM resources. -The packet rate is f_sub = 195312.5 Hz. At 10GbE this means that the maximum packet size is -10e9/195312.5 = 6400 octets. For oversampled subbands the maximum packet size drops to about -6400 / 1.25 = 5120 octets. If the minimum packet size is e.g. 4000 octets, then at 10GbE this -means that the link cannot be fully used, whereas at 40GbE multiple packets will still fit. The -maximum packet size for 10GbE also depends on the number of packets on the ring: -. With one packet on the ring the maximum packet size for 10GbE is 5120 (R_os = 1.25) octets, -. With N = 16 nodes and all nodes sending to the same end node the maximum packet size is 5120/16 - = 320 (R_os = 1.25) octets, -. If the packet only needs to travel N/2 nodes then the maximum packet size is 5120/8 = 640 - (R_os = 1.25) octets. - -Design descision: Assume the ring will use 4 x 10GbE, because it is known technology and suitable. - - -Internally in the FPGA the 10GbE data on the ring interface is available as 64 bit data at -156.25 MHz (64 * 156.25M = 10G). - - -Ring application Ethernet packet types: -The ring is used for the following application packet types: - -- 0x10FB for beamlets, -- 0x10FC for crosslets, -- 0x10FD for subband offload, -- 0x10FE for transient buffer read out - -The packet type information can be transported via the Ethernet type field or via an UDP port -number. If each lane is only used for one kind of packet type, then the packet type is only used -for information, because the PN already knows the packet type. The packet type value is based -on packet types that were defined in RSP, where 0x10FA was used to identify M&C data (0x10FA ~= -LOFAR) and the other type values just increment the 0x10FA value. - -Design decision: Transport application packet type via Ethernet type field for information - - - -OSI 3 Network layer: Use ring - -Wormhole routing (or cut-through routing) or store-and-forward routing: -With worm hole routing a received packet or a received and modified packet is already -transmitted, while the tail of the packet is still being received. The advantage of wormhole -routing is that it minimizes the latency along the ring and therefore also local buffering to -align between local and remote data. The disadvantage of wormhole routing is that a CRC error -on the received packet needs to be propagated by forcing the CRC of the transmitted packet to be -wrong. This implies that all subsequent hops will show this CRC error. For link diagnoses this -is confusing, because the subsequent links did not cause the CRC error. With store-and-forward -routing a packet is first received entirely before it is passed on for transmit. This allows to -discard a received packet with a CRC error, but does increase the latency on the ring. Packets -with a CRC error cannot be allowed to enter the processing in the node, because any bit in the -packet may be corrupted, especially in the packet header, so no meaningfull processing is -possible. - -Design descision: - For LOFAR 2.0 choose to use store-and-forward, because it allows discarding packets with CRC - errors when they occur and because there is sufficient internal block RAM to buffer the local - data for the worst case ring latency. - - -Only accept correct packets: -Discard all packets that have a CRC error. This also prevents that packets of wrong length enter -the internal processing. The Ethernet CRC error is 32 bit, so it is very unlikely that packet with -errors still has a correct CRC. With wormhole routing it was necessary to limit or extend a packet -to a known fixed length, because also packets with CRC error are passed on. With store-and-forward -routing the CRC provides sufficient protection to ensure that only correct packets enter the -application. - - +DONE: Ring latency: The RSP boards use wormhole routing on the ring. The latency of 1 hop between RSP boards is about 0.2 us. The time to transmit one Ethernet frame of 1500 octets at 10Gbps is about 1.2 us and a @@ -213,75 +43,15 @@ The dominant latencies in t_hop are t_fill and t_store. If t_hop < T_sub, then e require one block buffering of the local input to be able to align it to the remote input. The total buffering for the local input is then (N-1)*P_packet. +The latency on the ring is about 1 packet per transit hop, due to the store-and-forward. The +first hop has negligible latency. Hence with H hops the local data buffer size needs to be +(H-1) * local data size. -OSI 4 Transport layer: Use UDP/IP/ETH or only ETH on the ring: -We already have a UDP offload component that supports DP/UDP/IP/ETH, but a similar component that -only supports DP/ETH is easily derived from it. With an UDP the LOFAR packet type information can -be transported via the UDP port field. Using UDP/IP makes it easier to send the data to a PC for -monitoring purposes, however it is also possible to sniff raw Ethernet packets on a PC. Using a -PC to verify the ring allows capturing large amounts of data. On an FPGA we can use a data buffer -to sniff the packets, but only a few. The extra overhead of UDP = 8 octets and IP = 20, so 28 -octets in total. The disadvantage of using UDP/IP is that it adds some extra traffic overhead and -uses some extra logic resources, but that could be acceptable. The disadvantage of verifying the -ring using a PC are: - -- between FPGAs on the same UniBoard the ring can only be observed on the FPGA -- the ring will only connect FPGAs in the application, so using a PC is a side track that as such - may cause extra work. - -Using UDP/IP does not make it possible to replace the ring by a switch without modifications, so -changing from a ring based design to a switch based design will still imply a redesign of the -data transport scheme. - -Design decision: - Use raw ETH and verification on FPGA, because that fits the ring (especially between FPGAs on - UniBoard2) and avoids the extra overhead of UDP/IP. - - -Ring application header DP/ETH: -The packet payload needs to have an application header to carry the timestamp and a stream -identifier. This information can be tranported via the DP packet header which has a BSN field and -a channel field. The BSN is the timestamp. The channel field can carry the source PN index and -destination PN index. These PN indices are also available in the ETH source and destination MAC -addresses of ETH encoded packets, but they also need to be available in ETH decoded packets. In -ETH encoded packets the destination MAC address allow direct pass on of transit packets on the -ring, without having to ETH decode them. In ETH decoded packets the BSN and channel fields can be -passed along inside the encoded DP packet or in parallel with the decoded DP packet application -data. The channel information can be used to process the remote packets in parallel e.g. per -source PN index. The channel information can also provide flagging information, to e.g. identify -filler packets. - -Design decision: - Use DP/ETH. Together the CP CRC and ETH CRC ensure that for the lifetime of LOFAR2.0 packets - with correct CRC will not have false positives. Use a bit in the channel field to indicate - filler packets. - - -What is the DP/ETH packet overhead? - -- The ETH packet overhead consists of: - . Add 8 octets (c_network_eth_preamble_len) for Ethernet preamble - . Add 14 octets for the ETH header that contains destination MAC (6), source MAC (6) and - Ethernet type (2) - . Add 2 octets to pad the ETH header to align to 8 byte word boundary - . Add 4 octets for CRC - . Add 12 octets (c_network_eth_gap_len) for Ethernet gap size between packets - = 8 + 14 + 2 + 4 + 12 = 40 octets - -- The DP packet overhead consists of (dp_packet_enc_crc / dp_packet_dec_crc): - . Add 4 octects for CHAN (32b) - . Add 8 octects for Sync & BSN (64b) - . Add 4 octects for ERR (32b) - . Add 4 octects for CRC (32b) - = 4 + 8 + 4 + 4 = 20 octets - -Design decision: The DP/ETH packet overhead is P_overhead = 60 octets. - - -Use one packet type per ring lane: -This avoids having to multiplex different packet types onto a single lane. Still the Ethernet type -can be used to fill in the packet type to more easily identify data on different lanes of the ring. +******************************************************************************* +* Nof lanes +******************************************************************************* +DONE: How many transceivers are needed for the ring? The ring uses 4 of the 12 available transceivers, to match the QSFP cable link that is needed to connect the ring between UniBoard2. @@ -293,11 +63,6 @@ data loads are: - << 10Gbps transient buffer data -Link monitoring: -The link should be monitored during normal operation and to avoid the need to define and control a -test packet (e.g. like ping). The link monitoring should directly identify the source of a error -(e.g. tx node, link, rx node). -Design decision: Use DP/ETH packets to monitor the link quality. ******************************************************************************* @@ -309,99 +74,8 @@ OSI 6 Presentation layer: OSI 7 Application layer: -The ring function has the following sub functions: -- Receive packets from ring (and remove CRC field) -- Discard incorrect packets (based on CRC) -- Pass on transit packets (Destination MAC > PN index for forward ring, MAC < PN index for backward - ring) -- Decode packets (get packet from ring for internal use) -- Encode packets (put internal packet onto ring) -- Multiplex local and transit packets -- Transmit packets onto ring -- Monitor Rx and Tx packets -- Align packets for processing (use filler data on inputs with lost packets) - - -Ring access and transport schemes: - -- 1) ring combiner scheme: start node sends packet to end node, intermediate nodes modify the - packet (= combine local with remote). -- 2a) ring endcast scheme: each node starts sending its packets to an end node (= end cast), - intermediate nodes pass on the packet -- 2b) ring multicast scheme: each node starts sending its packets to an end node, intermediate - nodes pass on the packet and use the packet (= multi cast) - -If both scheme 1 and 2 are suitable, then scheme 1 typically yields a larger payload, because it -reserves slots for all nodes, whereas the payload for scheme 2 only contains data from one node. -Scheme 1 and 2b are useful if the transit nodes also use or modify the packet data. The multiple -hops are then used to multi cast the data. Scheme 2a is suitable for packet transport from start -to end node, whereby transit nodes only pass on the packet. - -With scheme 1 each node has a two input BSN aligner that needs to buffer a large packet. With -scheme 2 the end node has a N input BSN aligner that needs to align N small packets, Even -though scheme 2 only uses the BSN aligner at the en node, it is there at all nodes, because all -nodes run the same firmware image. Therefore the resource usage of the BSN aligner will -typically not differ much for scheme 1 or 2. - -If one hop fails in scheme 1 then there is no offload. If one hop fails in scheme 2a then there -is still offload from subsequent hops. - -For the beamformer beamlets scheme 1 is most suitable. The start node prepares the packet with -the initial beamlet sums. The subsequent nodes add there local beamlet sum to the packet -beamlet sums and then pass on the packet. - -For the subband correlator both scheme 1 and scheme 2b are suitable. For scheme 1 the start node -creates a packet with slots for all nodes and fills in its own slot with its crosslets. Scheme 1 -was used in LOFAR 1.0. The subsequent nodes fill in their slots with their crosslets and also -use the packets to correlate the remote crosslets with their local crosslets. With scheme 2b -each node creates a packet with its own crosslets and sends it to N/2 nodes further. The -intermediate node pass on or remove the packets and use the packets to correlate the remote -crosslets with their local crosslets. The disadvantage of scheme 1 is that it requries a -dedicated start node that initiates the aggregate packet. With scheme 2b each node acts as start -node for its own packet. Intermediate nodes use the remote packets for correlation and pass -them on. The final destination node removes the packet. - -For the subband offload both scheme 1 and scheme 2a are suitable. For scheme 1 the start node -creates a packet with slots for all nodes and fills in its own slot with its subbands. The -subsequent nodes fill in their slots with their subbands. With scheme 2a each node creates a -packet with its own subbands and sends it to the output end node. The other nodes only pass on -the remote packets. - -For transient buffer read out scheme 2a is most suitable to gather the read out data from each -node at the output end node. - - -Ring access directions: -The ring can be used in both directions. The forward direction is e.g. from PN0 to 15, the -backward direction is e.g. from PN 15 to 0 for N = 16 nodes. -All schemes can be used in two directions for the same type of data transport. In one direction -the maximum number of hops between start and end node is N-1, while by using both directions the -maximum number of hops between start and end node is N/2. If the data is used on all intermediate -nodes, then there is no advantage to use the ring in both directions. If the data is only passed -along by intermediate nodes, then the link capacity is used about a factor two more efficiently -by sending data in both directions. Disadvantages of using the ring in both directions for the -same type of data are that each node needs to decide which direction to use, that the data arrives -from both directions at the end node, and that it is somewhat more difficult to understand and -diagnose. - -Design decision : Therefore choose to use the ring in only one direction per link. - - -Use one link per packet type: -For scheme 2 use only one link for all source nodes, so do not let different source nodes use -different links. For N/2 = 8 or N = 16 the number of links would become too large. By using one -link for all sources, increasing the processing becomes a matter of using and instantiating more -links. - - -Remote and local data alignment: -In APERTIF the data arrived from >= 2 remote streams. With the LOFAR ring there is always local -data that arrives first and needs to be aligned with only one remote data stream. The local data -needs to be buffered until the remote data from the farthest PN has arrived. The latency on the -ring is about 1 packet per transit hop, due to the store-and-forward. The first hop has negligible -latency. Hence with H hops the local data buffer size needs to be (H-1) * local data size. - +DONE: Ring data transport schemes: - beamlets on ring: l --> r+l --> r+l --> ... --> r+l . on each node align two inputs: l,r @@ -414,7 +88,8 @@ Ring data transport schemes: and count unflagged blocks to know the number of active blocks per integration sync interval. - subbands on ring: l, rl, rrl, rrrl, ..., rrrrrrrrrrrrrrrl - . on final node align all l,(N-1)*r inputs + . on final node align all l,(N-1)*r inputs to allow reordering. If reordering is not needed, + then the local data and the remote data from the ring can be offloaded directly without aligning. . output filler data if remote got lost, to preserve nominal output rate to AARTFAAC - transient buffer readout: l, r, r, ..., r @@ -422,19 +97,49 @@ Ring data transport schemes: +DONE: +Ring addressing +At the destination node its the packets are allocated to BSN aligner based on the rn index of the source +node. For the combiner scheme 1 the BSN aligner has N_channel = 2 inputs, where by the local input is connected to +input 0 and the remote input is connected to input 1. +For the multi cast scheme 3 the BSN aligner has N_channel = N_cast inputs. The local data is connected to input +channel 0 of the BSN aligner. The remote input is connected to the channel +nodenumber of hops that they have travelled. + +1) The local packet is always allocated to input 0 of the BSN aligner. +Positive direction (dir > 0): + +RN 0 1 2 3 4 5 6 dest align at rx rn - if dir>0: if < 0: + rn dest rn dest rn * -1 + N_pn + + d s-- 0 5 6 0 5 6 0 -5 -6 0 2 1 0 + --d s 1 6 0 1 5 -1 0 -5 1 0 2 1 0 + s---d 2 0 1 2 -2 -1 0 2 1 0 2 1 0 + s---d 3 1 2 3 -2 -1 0 2 1 0 2 1 0 + s---d 4 2 3 4 -2 -1 0 2 1 0 2 1 0 + s---d 5 3 4 5 -2 -1 0 2 1 0 2 1 0 + s---d 6 4 5 6 -2 -1 0 2 1 0 2 1 0 + +Negative direction (dir < 0): + +RN 0 1 2 3 4 5 6 dest align at subtract if dir>0: of < 0: + rn dest rn dest rn * -1 + N_pn + + d---s 0 0 1 2 0 1 2 0 1 2 + d---s 1 1 2 3 0 1 2 0 1 2 + d---s 2 2 3 4 0 1 2 0 1 2 + d---s 3 3 4 5 0 1 2 0 1 2 + d---s 4 4 5 6 0 1 2 0 1 2 + s d-- 5 5 6 0 0 1 -5 0 1 2 + --s d 6 6 0 1 0 -6 -5 0 1 2 + + + ******************************************************************************* * Beamformer ******************************************************************************* -What is the beamlet packet size? -The beamlet sum is passed on along the ring from start PN to end PN using ring access scheme 1. At -the end PN the final beamlet sum is scaled to W_beamlet = 8 bit and output to CEP. The intermediate -beamlet sum has W_beamlet = 18 bit and is complex. There are N_pol * S_sub_bf = 2 * 488 = 976 -beamlets per packet. The payload size is N_pol * S_sub_bf * N_complex * W_beamlet_sum / W_byte = -2 * 488 * 2 * 18 / 8 = 4392 octets. The effective packet size is 60 + 4392 = 4452 octets. With -f_sub = 195312.5 Hz the data rate is 4452 * 195312.5 * 8 = 6.95625 Gbps < L_lane = 7.8125, so it -fits on a 10GbE lane. Packet decoding and encoding: The start node encodes the packet and the end node decodes the packet. The intermediate nodes could @@ -446,7 +151,7 @@ intermediate nodes can reuse the encoding function of the start node and the dec end node, so no extra logic is needed. Ring adder payload processing: -The full band station beam has S_sub_bf = 488 beamlets per polarization, so in total there are +The beamset has S_sub_bf = 488 beamlets per polarization, so in total there are N_pol * S_sub_bf = 2 * 488 = 976 complex beamlets per subband period of N_clk = 1024 cycles. The ring adder adds the local beamlet sum to the received beamlet sum and passes on the result. The beamlet sum is received as a packet with 64 bit packed data at 156.25 MHz (64 * 156.25M = @@ -474,7 +179,7 @@ packet only needs to be buffered once. If both CRC are correct then, the Rx payl from the store-and-forward buffer, else it is discarded. The released payload is then repacked to obtain the remote beamlets. The remote beamlets are then aligned to the local beamlets and summed. The summed beamlets are repacked to 64b data and then DP/ETH encoded. The DP encoding -adds the DP CRC and the 10GbE MAC will add the ETH CRC. +adds the DP CRC and the 10GbE MAC adds the ETH CRC. In the DP domain, at 200 * R_os MHz, there are N_clk = 1024 cycles available to process the packet. The clock domain crossing from Rx 156M to DP 200M causes gaps in the data, but these gaps @@ -541,7 +246,7 @@ beam then differs, dependent on where on the ring the packet got lost. Ring modes: -- off +- off --> - local - remote - combine @@ -574,14 +279,14 @@ The beamformer function has the following sub functions: ******************************************************************************* With transport scheme 1 crosslets from different source nodes are combined into one packet. -Scheme 2b packs only local crosslets into a packet. Compared to scheme 1, scheme 2b: +Scheme 3 packs only local crosslets into a packet. Compared to scheme 1, scheme 3: - treats the local crosslets and remote crosslets independently - has small payload and thus more packet overhead, but the packet load still fits on a lane - has small payload that can be enlarged by transporting more local crosslets, to support a subband correlator with N_crosslets > 1 per integration interval. Design decision: - Use transport scheme 2b with N/2 hops where every node sends its local crosslets N/2 hops, + Use transport scheme 3 with N/2 hops where every node sends its local crosslets N/2 hops, because it is more flexible to have only local crosslets per packet. @@ -873,7 +578,7 @@ The subbands are gathered at the output node via the ring. Using the ring avoids a 10GbE switch. Such a switch would need > 16 + 16 ports to support LBA + international HBA and some output ports. If the data is gathered, then it can as well be reordered to combine all S signal inputs in a single payload. The subbands can be send to the output node via the ring using -either scheme 1 or scheme 2a: +either scheme 1 or scheme 2: Select N = 48 from N_lba = 96 antennas: @@ -893,7 +598,7 @@ be selected from the N_lba = 96 antennas at different stages within SDP: - Transport all SI to the end node and select there. The advantage is that an arbirary selection can be done at the end node. - . With transport scheme 2a the selection of N from N_lba will be done at the end node, so all + . With transport scheme 2 the selection of N from N_lba will be done at the end node, so all PN then send all their S_pn inputs via the ring. The disadvantage is that it doubles the load on the ring and requires a selection at the offload node. @@ -904,17 +609,17 @@ be selected from the N_lba = 96 antennas at different stages within SDP: . With transport scheme 1 the payload is passed along and each node can insert none, all or a subset of its S_pn at the allocated subband index in the payload. The payload size is fixed, because it contains S signal inputs. - . With transport scheme 2a each node only sends the selected inputs. For arbitrary selection + . With transport scheme 2 each node only sends the selected inputs. For arbitrary selection this yields payload sizes that depend on the selection, which is awkward. The advantage of scheme 1 is that the output payload is already formed by the selection at each -node. With scheme 2a a multiplexer is needed to combine the paylaods from all nodes into the +node. With scheme 2 a multiplexer is needed to combine the paylaods from all nodes into the output packet. If in scheme 1 a packet gets lost, then all subbands from the remote nodes that were already passed is lost. If in scheme 1 a packet gets lost, then only the subbands from the node that send that packet are lost. Design decision: -- Assume the SO only transports the selected subbands and uses scheme 2a. The selection is made +- Assume the SO only transports the selected subbands and uses scheme 2. The selection is made by letting each node either send all S_pn = 12 inputs or none. Hence only N/2 = 8 nodes send subbands, the other N/2 nodes are remain quite. The selected nodes are identified via the channel field, e.g. if node 0, 3, 4, 5, 6, 7, 8, 11 are selected for output, then the get @@ -928,7 +633,7 @@ Design decision: For LOFAR 2.0 the number of LBA doubles to S_lba = 192, but AARTFAAC2.0 assumes that still S = 96 will offload subbands. Assume that only the selected S = 96 signal inputs are transported via the -ring using scheme 2a. For the LBA the ring in SDP connects N = S_lba / S_pn = 192 / 12 = 16 nodes, +ring using scheme 2. For the LBA the ring in SDP connects N = S_lba / S_pn = 192 / 12 = 16 nodes, so N-1 hops. Assume the subbands are send in one direction along the ring. The subband data load on the last hop is then (N-1)/N * 19.2G = 15/16 * 19.2G = 18.0 Gbps, excluding packet overhead. Given a lane load capacity of L_lane = 7.8125 Gbps, this implies that the subband offload requires @@ -946,7 +651,7 @@ inserts its local subbands at the appropriate offset in the payload. The packet P_packet * W_byte * f_sub * R_os = 4288 * 8 * 195312.5 * 1.28 = 8.576 Gbps. Hence transporting S_sub_lane = 22 subbands for S = 96 SI fits on a 10GbE lane. -If the subband data is send in separate packets for each PN using scheme 2a, then the payload size +If the subband data is send in separate packets for each PN using scheme 2, then the payload size for S_sub_lane = 24 subbands per lane, S_pn = 12 signal inputs per PN and W_subband_so = 8 bit becomes S_pn * S_sub_lane * N_complex * W_subband_so / W_byte = 12 * 24 * 2 * 8/8 = 576 octets. The packet size is P_packet = 60 + 576 = 636 octets. The packet overhead is P_packet / P_payload @@ -958,21 +663,21 @@ instead S_sub_lane = 22 subbands yields P_payload = 12 * 22 * 2 * 8/8 = 528, P_p = 588 and an aggregate load of 16/2 * 588 * 8 * 195312.5 * 1.28 = 9.408 Gbps, which fits on a 10GbE lane. -Both scheme 1 and scheme 2a can transport S_sub_lane = 22 subbands per 10GbE lane. The difference -is that scheme 1 has a load of 8.576 Gbps on all hops, whereas for scheme 2a the load increases +Both scheme 1 and scheme 2 can transport S_sub_lane = 22 subbands per 10GbE lane. The difference +is that scheme 1 has a load of 8.576 Gbps on all hops, whereas for scheme 2 the load increases with every hop and has a maximum of 9.408 Gbps on the last hop. With scheme 1 each node has to put its local subbands at the right location in the packet. In this way the end node only needs to output the payload, because the data is already in the subband offload payload format. With -scheme 2a all nodes just send their local data and pass on the transit data. At the end node a +scheme 2 all nodes just send their local data and pass on the transit data. At the end node a demultiplexer and BSN aligner are needed to align the packets from all N/2 = 16 nodes. After that the end node needs to reorder the data from these N/2 = 8 input payloads into the subband offload payload format. This functionality in the end node is similar to the rsp_terminal function on -UniBoard1 for AARTFAAC. Scheme 1 is specific to the ring, scheme 2a would also work if the +UniBoard1 for AARTFAAC. Scheme 1 is specific to the ring, scheme 2 would also work if the subband data is send to the end node via a switch (or via URI like with RSP). -With scheme 2a the ring could be used in both directions, but this does not improve the capacity -of the ring. With scheme 2a in one direction the packets travel 1+2+3+...+(16-1) = 120 hops. -With scheme 2a in both directions the packets travel 1+2+3+4+5+6+7+8 = 36 hops left and +With scheme 2 the ring could be used in both directions, but this does not improve the capacity +of the ring. With scheme 2 in one direction the packets travel 1+2+3+...+(16-1) = 120 hops. +With scheme 2 in both directions the packets travel 1+2+3+4+5+6+7+8 = 36 hops left and 1+2+3+4+5+6+7 = 28 hops right, so total 64 hops. For the transport load on the ring as a whole using both directions is a factor 102/64 = 1.875 more efficient. However at the end node both the ring still has to transfer the same load of N/2 = 8 packets. Therefore at the end node the @@ -999,7 +704,7 @@ Design decision: stage of user application) - Select subbands in groups of S_pn SI per node from N/2 = 8 nodes to have S = 96 signal inputs. The SI from the other nodes are then not used and not transported. -- Use ring transport scheme 2a and in one direction (simpler control than using both +- Use ring transport scheme 2 and in one direction (simpler control than using both directions) - On the ring 3 lanes are sufficient to transport 22 subbands per lane. At the output 3 lanes are sufficient, but use 4 lanes and output 16 subbands per lane, to reduce the load per diff --git a/applications/lofar2/doc/prestudy/station2_sdp_srs.txt b/applications/lofar2/doc/prestudy/station2_sdp_srs.txt index d746394cc0..5b56a3e0bd 100755 --- a/applications/lofar2/doc/prestudy/station2_sdp_srs.txt +++ b/applications/lofar2/doc/prestudy/station2_sdp_srs.txt @@ -116,6 +116,8 @@ LOFAR2-4301 Start station beamlet output --> Output beamlets LOFAR2-4300 Stop station beamlet output --> Output beamlets LOFAR2-3220 Examine data at each processing step --> DB +LOFAR2-4392 Station coordinate systems + Transient buffer LOFAR2-3420 Transient buffer --> Buffer LOFAR2-2305 Buffer length >= 2.5 s --> Buffer @@ -139,4 +141,4 @@ LOFAR2-3144 Transient detection mode LOFAR2-2310 Send trigger to TM --> Trigger to SC Subband offload -None \ No newline at end of file +None -- GitLab