From deba712170310be1f7135acf362898e10309bcd1 Mon Sep 17 00:00:00 2001 From: Eric Kooistra <kooistra@astron.nl> Date: Fri, 22 Nov 2019 15:37:33 +0100 Subject: [PATCH] Updated BF and XC part of station2_sdp_ring.txt. --- .../lofar2/doc/prestudy/station2_sdp_dsp.txt | 5 +- .../prestudy/station2_sdp_hdl_components.txt | 10 +- .../lofar2/doc/prestudy/station2_sdp_ring.txt | 838 ++++++++++-------- .../doc/prestudy/station2_to_do_erko.txt | 9 +- 4 files changed, 489 insertions(+), 373 deletions(-) diff --git a/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt b/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt index 993d1734ff..31bf474f7e 100644 --- a/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt +++ b/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt @@ -44,11 +44,14 @@ M&C: The (x+y) could be implemented as first (x+y) and then *w, or as first weight and then add. + ******************************************************************************* * Subband correlator ******************************************************************************* - +First the local crosslets are correlated with themselves and then +the local crosslets are kept in a barrel shifter, such that they can also be correlated with the +remote crosslets that arrive in the packets. ******************************************************************************* diff --git a/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt b/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt index c82daa7080..cc41dda277 100644 --- a/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt +++ b/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt @@ -44,7 +44,7 @@ - RSP RAD frame: . uses: FSI, FSN, DATA, CRC. . The FSN is 16 bit but the MSbit is used for the sync. The other 15 bits count blocks. - . After Rx frame the FSI is stripped and the CRC is replace by a BRC. + . After Rx frame the FSI is stripped and the CRC is replace by a boolean check (BRC). - CRC Error checking: The CRC is a 32 bit number, so the chance that the CRC results in a false positive is 1/2**32 ~= 2.3e-10 or 1 @@ -123,7 +123,7 @@ to ensure that all inputs have the same 64 bit sync and BSN. ******************************************************************************* -* BSN aligner +* BSN aligner dp_bsn_align_v2 ******************************************************************************* Assumptions: @@ -441,6 +441,12 @@ Design options: - flush per packet or flush until empty? - flush per input per input or flush all inputs? - flush by reading, or by reset or by moving a Rd pointer + A FIFO can be flushed by resetting it, but this requires careful control to ensure that the reset is + noticed in both clock domains, and that the reset is applied in between input packets to avoid that + only a tail of a packet gets into a FIFO. Therefore in LOFAR 1.0 and APERTIF a FIFO is flushed by + reading the packets from it until it is empty. This scheme also allows flushing per packet. The + disadvantage of reading the packets and the discard them, is that it takes as long as reading at full + speed. - Use packet count instead of FIFO full indicator - can we do without flushing the FIFO? Not if we need to realign. - If multiple packets on a remote input get lost, then the other inputs fill up if there is no timeout. Flush diff --git a/applications/lofar2/doc/prestudy/station2_sdp_ring.txt b/applications/lofar2/doc/prestudy/station2_sdp_ring.txt index b484c32f0e..90e1fe9030 100644 --- a/applications/lofar2/doc/prestudy/station2_sdp_ring.txt +++ b/applications/lofar2/doc/prestudy/station2_sdp_ring.txt @@ -2,60 +2,118 @@ Detailed design: RING ******************************************************************************* -* Data format +* Data rate ******************************************************************************* Support for oversampled subband filterbank -The oversampling increases the processing rate and data rate by a factor R_os. Typical R_os are 32/28 = 1.142, -32/27 = 1.185, 32/26 = 1.231, 32/25 = 1.28, 32/24 = 1.333. Assume R_os <= 1.28. - -Assume the processing for critically sampled filterbank runs at 200 MHz and for oversampled subbands it will run at -R_os * 200 MHz. For R_os = 1.28 this requires processing at >= 256 MHz. In this way if the processing fits for the -critically sampled subbands, then it will also fit for the oversampled subbands. - -The IO data rate on the ring increases with the oversampling factor R_os. For oversampled data the ring 10GbE has -the full 10 Gbps capacity and for critically sampled data the effective ring capacity becomes 10G / R_os = -10G / 1.28 = 7.8125 Gbps. The aim is to be able to replace the critically sampled filterbank by an oversampled -filterbank without having to change other parts in the design. Therefore assume that the ring capacity for the -critically sampled data is restricted to 7.8125 Gbps. The alternative to use full ring capacity for critically -sampled data and then support less (S_sub_bf / R_os = 488 / 1.28) beamlets for oversampled data is not compliant -with the requirement of S_sub_bf = 488. +The oversampling increases the processing rate and data rate by a factor R_os. Typical R_os are +32/28 = 1.142, 32/27 = 1.185, 32/26 = 1.231, 32/25 = 1.28, 32/24 = 1.333. Assume R_os <= 1.28. + +Processing capacity per subband period: +Assume the processing for critically sampled filterbank runs at 200 MHz and for oversampled +subbands it will run at R_os * 200 MHz. For R_os = 1.28 this requires processing at >= 256 MHz. +This means that the processing has N_fft = 1024 clock cycles avaiable per subband period T_sub, +independent of R_os. In this way if the processing for the critically sampled subbands fits +within N_clk = N_fft = 1024 clock cycles, then it will also fit for the oversampled subbands. + +IO capacity per 10GbE lane: +The IO data rate on the ring increases with the oversampling factor R_os. For oversampled data +the ring 10GbE has the full 10 Gbps capacity and for critically sampled data the effective +ring capacity per lane becomes L_lane = 10G / R_os = 10G / 1.28 = 7.8125 Gbps. The aim is to +be able to replace the critically sampled filterbank by an oversampled filterbank without +having to change other parts in the design. Therefore assume that the ring capacity for the +critically sampled data is restricted to L_lane < 7.8125 Gbps. + +Note: +The alternative to use full ring capacity for critically sampled data and then support less +(S_sub_bf / R_os = 488 / 1.28 = 381, so almost 30 % less) beamlets for oversampled data is not +compliant with the requirement of S_sub_bf = 488. Design descision: Support S_sub_bf = 488 also for maximum R_os = 1.28. W_beamlet_sum -LOFAR 1.0 had 24 bit for 16 bit beamlet mode and 12 bit for 8 bit beamlet mode. LOFAR 2.0 will only support 8 bit. -Using W_beamlet_sum = 18 bit provides 5 bits more dynamic range for 8 bit beamlet mode, which is sufficient to -detect overflow. Using W_beamlet_sum = 18 bit also fits the input data width of the FPGA hard core multipliers in -the BST. Given that the signal input level is 4 bit the beamformer could round 2 LSbit to effectively achieve -20 bit dynamic range, even for S = 1 signal input. However the same effect can also be achieved by reducing the -beamlet weights by a factor 2**2 = 4. Choose the same W_beamlet_sum = 18 bit for both the critically sampled -beamlet data and the oversampled beamlet data, to avoid differences in the design. -The beamlet sum that is transported across the ring needs to fit on a 10GbE link. With S_sub_bf = 488 and R_os <= -1.28 the data rate for one full band station beam is N_pol * S_sub_bf * f_sub * R_os * N_complex * W_beamlet_sum -= 2 * 488 * 195312.5 * 1.28 * 2 * 18 = 8.784 Gbps. This leaves about 13.8 % margin for packet overhead, which is -sufficient. Using W_beamlet_sum = 18 bit fits the input data width of the FPGA hard core multipliers and also -provides sufficent dynamic range to scale the final beamlet sum to W_beamlet = 8 bit for output. - -Design descision: W_beamlet_sum = 18 bit for both critically sampled beamlet and oversampled beamlets +LOFAR 1.0 had 24 bit for 16 bit beamlet mode and 12 bit for 8 bit beamlet mode. LOFAR 2.0 will +only support 8 bit. Using W_beamlet_sum = 18 bit provides 5 bits more dynamic range for 8 bit +beamlet mode, which is sufficient to detect overflow. Using W_beamlet_sum = 18 bit also fits the +input data width of the FPGA hard core multipliers in the BST. Given that the SDP signal input +level is 4 bit the beamformer could round 2 LSbit to effectively achieve 20 bit dynamic range, +even for S = 1 signal input. However the same effect can also be achieved by reducing the beamlet +weights by a factor 2**2 = 4. Choose the same W_beamlet_sum = 18 bit for both the critically +sampled beamlet data and the oversampled beamlet data, to avoid differences in the design. The +beamlet sum that is transported across the ring needs to fit on a 10GbE lane. With S_sub_bf = 488 +the data rate for one full band station beam is N_pol * S_sub_bf * f_sub * N_complex * +W_beamlet_sum = 2 * 488 * 195312.5 * 2 * 18 = 6.8625 Gbps. Using L_lane = 7.8125 Gbps this leaves +about 1 - 6.8625 / 7.8125 = 12 % margin for packet overhead, which is sufficient. + + +Design descision: + Use W_beamlet_sum = 18 bit for both critically sampled beamlet and oversampled beamlets. + Using W_beamlet_sum = 18 bit fits the on one 10GbE lane on the ring, fits the input data width + of the FPGA hard core multipliers and also provides sufficent dynamic range to scale the final + beamlet sum to W_beamlet = 8 bit for output. + ******************************************************************************* -* Ring function +* Ring links: ******************************************************************************* -Ring transceiver medium access (MAC): -Use Ethernet per transceiver link.The Ethernet MAC provides link establishment, so it uses a full duplex transceiver. The -Ethernet packet header contains destination MAC address, source MAC address and Ethernet type. The Ethernet packet tail -contains a CRC. The CRC provides data error detection. No need to use UDP/IP and ARP, because the links in the ring are -point to point and will not be used in a network. The Ethernet fields can be used as: - - Destination MAC = destination PN index - - Source MAC = source PN index - - Ethernet type = packet type +OSI 1 Phyisical layer: Transceivers + +OSI 2 Data link layer: +Use Ethernet per transceiver link.The Ethernet MAC provides link establishment, so it uses a full +duplex transceiver. The Ethernet packet header contains destination MAC address, source MAC +address and Ethernet type. The Ethernet packet tail contains a CRC. The CRC provides data error +detection. There is no need to use UDP/IP and ARP, because the links in the ring are point to +point and will not be used in a network. The Ethernet fields can be used as: + +- Destination MAC = destination PN index +- Source MAC = source PN index +- Ethernet type = packet type + Design decision: Use Ethernet for the ring transceiver links -Ring application packet types: + +Use 10GbE or 40GbE: +From the low-latency Ethernet core user guides it follows that the Ethernet core with statistics +registers use: + +- 10GbE core : 4300 FF, 4 M9K +- 40GbE core : 21200 FF, 13 M20K + +The synhesis fitter results from Apertif BF and XC show that the tech_eth_10g takes about 5500 FF +and 4 (BF) or 7 (XC) M9K.The BF MAC has no statistics, the XC MAC does have statistics. +Hence the 40GbE core is about a factor 4 larger than the 10GbE core, so from a resource usage point +of view it is does not matter whether we use 4 x 10GbE or 1 x 40GbE. The advantage of 40GbE is +that it can fit data rates > 10Gbps per data type stream. The advantage of using 10GbE is that we +can use one link per data type stream and thereby avoid having to multiplex different data streams +onto the same 40GbE link. However some multiplexing of local packets and remote transit packets can +also be needed. UniBoard2 has been tested with 10GbE but not yet with 40GbE. +The Arria10 on UniBoard2 has 1708800 FF so 1708800 / 182400 = 9.3 times more than the Stratix IV +on UniBoard1. On UniBoard2 one 10GbE interface uses maximum about 5500 / 1708800 = 0.32 % of the +FF and maximum about 7 / 2713 = 0.25% of the block RAM. In total there will be 4 x 10GbE for the +intra board ring, 4 x 10GbE for the inter board ring and 1 x 10GbE for external IO, so these will +take about 3% of the FF and block RAM resources. +The packet rate is f_sub = 195312.5 Hz. At 10GbE this means that the maximum packet size is +10e9/195312.5 = 6400 octets. For oversampled subbands the maximum packet size drops to about +6400 / 1.25 = 5120 octets. If the minimum packet size is e.g. 4000 octets, then at 10GbE this +means that the link cannot be fully used, whereas at 40GbE multiple packets will still fit. The +maximum packet size for 10GbE also depends on the number of packets on the ring: +. With one packet on the ring the maximum packet size for 10GbE is 5120 (R_os = 1.25) octets, +. With N = 16 nodes and all nodes sending to the same end node the maximum packet size is 5120/16 + = 320 (R_os = 1.25) octets, +. If the packet only needs to travel N/2 nodes then the maximum packet size is 5120/8 = 640 + (R_os = 1.25) octets. + +Design descision: Assume the ring will use 4 x 10GbE, because it is known technology and suitable. + + +Internally in the FPGA the 10GbE data on the ring interface is available as 64 bit data at +156.25 MHz (64 * 156.25M = 10G). + + +Ring application Ethernet packet types: The ring is used for the following application packet types: - 0x10FB for beamlets, @@ -63,136 +121,235 @@ The ring is used for the following application packet types: - 0x10FD for subband offload, - 0x10FE for transient buffer read out -The packet type information can be transported via the Ethernet type field or via an UDP port number. If each link -is only used for one kind of packet type, then the packet type is only used for information, because the PN -already knows the packet type. The packet type value is based on packet types that were defined in RSP, where -0x10FA was used to identify M&C data (0x10FA ~= LOFAR) and the other type values just increment the 0x10FA value. +The packet type information can be transported via the Ethernet type field or via an UDP port +number. If each lane is only used for one kind of packet type, then the packet type is only used +for information, because the PN already knows the packet type. The packet type value is based +on packet types that were defined in RSP, where 0x10FA was used to identify M&C data (0x10FA ~= +LOFAR) and the other type values just increment the 0x10FA value. + Design decision: Transport application packet type via Ethernet type field for information -Use UDP/IP/ETH or only ETH on the ring: -We already have a UDP offload component that supports UDP/IP/ETH, but a similar component that only supports ETH is -easily derived from it. With an UDP the LOFAR packet type information can be transported via the UDP port field. -Using UDP/IP makes it easier to send the data to a PC for monitoring purposes, however it is also possible to sniff -raw Ethernet packets on a PC. Using a PC to verify the ring allows capturing large amounts of data. On an FPGA we -can use a data buffer to sniff the packets, but only a few. -The extra overhead of UDP = 8 octets and IP = 20, so 28 octets in total. The disadvantage of using UDP/IP is that -it adds some extra traffic overhead and uses some extra logic resources, but that could be acceptable. The -disadvantage of verifying the ring using a PC are: + +OSI 3 Network layer: Use ring + +Wormhole routing (or cut-through routing) or store-and-forward routing: +With worm hole routing a received packet or a received and modified packet is already +transmitted, while the tail of the packet is still being received. The advantage of wormhole +routing is that it minimizes the latency along the ring and therefore also local buffering to +align between local and remote data. The disadvantage of wormhole routing is that a CRC error +on the received packet needs to be propagated by forcing the CRC of the transmitted packet to be +wrong. This implies that all subsequent hops will show this CRC error. For link diagnoses this +is confusing, because the subsequent links did not cause the CRC error. With store-and-forward +routing a packet is first received entirely before it is passed on for transmit. This allows to +discard a received packet with a CRC error, but does increase the latency on the ring. Packets +with a CRC error cannot be allowed to enter the processing in the node, because any bit in the +packet may be corrupted, especially in the packet header, so no meaningfull processing is +possible. + +Design descision: + For LOFAR 2.0 choose to use store-and-forward, because it allows discarding packets with CRC + errors when they occur and because there is sufficient internal block RAM to buffer the local + data for the worst case ring latency. + + +Only accept correct packets: +Discard all packets that have a CRC error. This also prevents that packets of wrong length enter +the internal processing. The Ethernet CRC error is 32 bit, so it is very unlikely that packet with +errors still has a correct CRC. With wormhole routing it was necessary to limit or extend a packet +to a known fixed length, because also packets with CRC error are passed on. With store-and-forward +routing the CRC provides sufficient protection to ensure that only correct packets enter the +application. + + +Ring latency: +The latency of 1 hop is about 0.2 us. The time to transmit one Ethernet frame of 1500 octets at +10Gbps is about 1.2 us and a jumbo frame of 6400 octets takes about 5.12 us (= T_sub). Hence for +packets >~ 300 octets the ring latency is dominated by the store-and-forward routing at each node. +The 10GbE Ethernet MAC uses 64 bit data. At 200 MHz this can achieve 64 * 0.2 = 12.8 Gbps. Hence +if the processing operates without data valid gaps, then the Ethernet transmit will not run empty +during a payload. Therefore it is not necessary to use a fill FIFO, which would add to the ring +latency. For a packet that travels the entire ring the latency is then about (N-1) * T_sub and +the corresponding FIFO depth to align the local data with this remote data is (N-1) * packet size. + + + +OSI 4 Transport layer: Use UDP/IP/ETH or only ETH on the ring: +We already have a UDP offload component that supports DP/UDP/IP/ETH, but a similar component that +only supports DP/ETH is easily derived from it. With an UDP the LOFAR packet type information can +be transported via the UDP port field. Using UDP/IP makes it easier to send the data to a PC for +monitoring purposes, however it is also possible to sniff raw Ethernet packets on a PC. Using a +PC to verify the ring allows capturing large amounts of data. On an FPGA we can use a data buffer +to sniff the packets, but only a few. The extra overhead of UDP = 8 octets and IP = 20, so 28 +octets in total. The disadvantage of using UDP/IP is that it adds some extra traffic overhead and +uses some extra logic resources, but that could be acceptable. The disadvantage of verifying the +ring using a PC are: + - between FPGAs on the same UniBoard the ring can only be observed on the FPGA -- the ring will only connect FPGAs in the application, so using a PC is a side track that as such may cause extra - work. -Using UDP/IP does not make it possible to replace the ring by a switch without modifications, so changing from a -ring based design to a switch based design will still imply a redesign of the data transport scheme. -Design decision: Use raw Ethernet and verification on FPGA, because that fits the ring (especially between FPGAs - on UniBoard2) and avoids the extra overhead of UDP/IP. - -Ring application header: -The packet payload needs to have an application header to carry the timestamp and a stream identifier. This -information can be tranported via the DP packet header which has a BSN field and a channel field. The BSN is the -timestamp. The channel field can carry the source PN index and destination PN index. These PN indices are also -available in the ETH source and destination MAC addresses of ETH encoded packets, but they also need to be -available in ETH decoded packets. In ETH encoded packets the destination MAC address allow direct pass on of -transit packets on the ring, without having to ETH decode them. In ETH decoded packets the BSN and channel fields -can be passed along inside the encoded DP packet or in parallel with the decoded DP packet application data. The -channel information can be used to process the remote packets in parallel e.g. per source PN index. - - - -What is the ETH packet overhead? -The ETH packet overhead consists of: -. Add 8 octets (c_network_eth_preamble_len) for Ethernet preamble -. Add 14 octets for the ETH header that contains destination MAC (6), source MAC (6) and Ethernet type (2) -. Add 2 octets to pad the ETH header to align to 8 byte word boundary -. Add 4 octets for CRC -. Add 12 octets (c_network_eth_gap_len) for Ethernet gap size between packets - = 8 + 14 + 2 + 4 + 12 = 40 octets +- the ring will only connect FPGAs in the application, so using a PC is a side track that as such + may cause extra work. + +Using UDP/IP does not make it possible to replace the ring by a switch without modifications, so +changing from a ring based design to a switch based design will still imply a redesign of the +data transport scheme. + +Design decision: + Use raw ETH and verification on FPGA, because that fits the ring (especially between FPGAs on + UniBoard2) and avoids the extra overhead of UDP/IP. + + +Ring application header DP/ETH: +The packet payload needs to have an application header to carry the timestamp and a stream +identifier. This information can be tranported via the DP packet header which has a BSN field and +a channel field. The BSN is the timestamp. The channel field can carry the source PN index and +destination PN index. These PN indices are also available in the ETH source and destination MAC +addresses of ETH encoded packets, but they also need to be available in ETH decoded packets. In +ETH encoded packets the destination MAC address allow direct pass on of transit packets on the +ring, without having to ETH decode them. In ETH decoded packets the BSN and channel fields can be +passed along inside the encoded DP packet or in parallel with the decoded DP packet application +data. The channel information can be used to process the remote packets in parallel e.g. per +source PN index. The channel information can also provide flagging information, to e.g. identify +filler packets. + +Design decision: + Use DP/ETH. Together the CP CRC and ETH CRC ensure that for the lifetime of LOFAR2.0 packets + with correct CRC will not have false positives. Use a bit in the channel field to indicate + filler packets. + + +What is the DP/ETH packet overhead? +- The ETH packet overhead consists of: + . Add 8 octets (c_network_eth_preamble_len) for Ethernet preamble + . Add 14 octets for the ETH header that contains destination MAC (6), source MAC (6) and + Ethernet type (2) + . Add 2 octets to pad the ETH header to align to 8 byte word boundary + . Add 4 octets for CRC + . Add 12 octets (c_network_eth_gap_len) for Ethernet gap size between packets + = 8 + 14 + 2 + 4 + 12 = 40 octets + +- The DP packet overhead consists of (dp_packet_enc_crc / dp_packet_dec_crc): + . Add 4 octects for CHAN (32b) + . Add 8 octects for Sync & BSN (64b) + . Add 4 octects for ERR (32b) + . Add 4 octects for CRC (32b) + = 4 + 8 + 4 + 4 = 20 octets + +Design decision: The DP/ETH packet overhead is P_overhead = 60 octets. + + +Use one packet type per ring lane: +This avoids having to multiplex different packet types onto a single lane. Still the Ethernet type +can be used to fill in the packet type to more easily identify data on different lanes of the ring. How many transceivers are needed for the ring? -There are four data types beamlets, crosslets, subband offload and transient buffer read out. The data loads are: -- 488 beamlets (R_os = 1 --> W_beamlet_sum = 24 bit, R_os = 1.25 --> W_beamlet_sum = 19.2 ~= 20 bit) +The ring uses 4 of the 12 available transceivers, to match the QSFP cable link that is needed to +connect the ring between UniBoard2. +There are four data types beamlets, crosslets, subband offload and transient buffer read out. The +data loads are: +- 488 beamlets (R_os = 1 --> W_beamlet_sum = 18 bit, R_os = 1.28) - ~10 crosslets (R_os = 1 --> 15 crosslets, R_os = 1.25 --> 12 crosslets) -- ~ subbands (R_os = 1 -- +- subbands (R_os = 1 +- << 10Gbps transient buffer data + + +Link monitoring: +The link should be monitored during normal operation and to avoid the need to define and control a +test packet (e.g. like ping). The link monitoring should directly identify the source of a error +(e.g. tx node, link, rx node). +Design decision: Use DP/ETH packets to monitor the link quality. + -Choose to transport one data type packet per 10GbE link direction. +******************************************************************************* +* Ring usage: +******************************************************************************* + +OSI 5 Session layer: +OSI 6 Presentation layer: +OSI 7 Application layer: -The ring can be used in both directions. The forward direction is e.g. from PN0 to 15, the backward direction is e.g. -from PN 15 to 0. The ring uses 4 of the 12 available transceivers, to match the QSFP cable link that is needed to connect -the ring between UniBoard2. The ring function has the following sub functions: - Receive packets from ring (and remove CRC field) - Discard incorrect packets (based on CRC) -- Pass on transit packets (Destination MAC > PN index for forward ring, MAC < PN index for backward ring) +- Pass on transit packets (Destination MAC > PN index for forward ring, MAC < PN index for backward + ring) - Decode packets (get packet from ring for internal use) - Encode packets (put internal packet onto ring) - Multiplex local and transit packets - Transmit packets onto ring +- Monitor Rx and Tx packets +- Align packets for processing (use filler data on inputs with lost packets) -Use 10GbE or 40GbE: -From the low-latency Ethernet core user guides it follows that the Ethernet core with statistics registers use: - 10GbE core : 4300 FF, 4 M9K - 40GbE core : 21200 FF, 13 M20K -The synhesis fitter results from Apertif BF and XC show that the tech_eth_10g takes about 5500 FF and 4 (BF) or 7 (XC) M9K. -The BF MAC has no statistics, the XC MAC does have statistics. -Hence the 40GbE core is about a factor 4 larger than the 10GbE core, so from a resource usage point of view it is does not matter -whether we use 4 x 10GbE or 1 x 40GbE. The advantage of 40GbE is that it can fit data rates > 10Gbps per data type stream. The -advantage of using 10GbE is that we can use one link per data type stream and thereby avoid having to multiplex different data -streams onto the same 40GbE link. However some multiplexing of local packets and remote transit packets can also be needed. -UniBoard2 has been tested with 10GbE but not yet with 40GbE. -The Arria10 on UniBoard2 has 1708800 FF so 1708800 / 182400 = 9.3 times more than the Stratix IV on UniBoard1. On UniBoard2 one -10GbE interface uses maximum about 5500 / 1708800 = 0.32 % of the FF and maximum about 7 / 2713 = 0.25% of the block RAM. -In total there will be 4 x 10GbE for the intra board ring, 4 x 10GbE for the inter board ring and 1 x 10GbE for external IO, so -these will take about 3% of the FF and block RAM resources. -The packet rate is f_sub = 195312.5 Hz. At 10GbE this means that the maximum packet size is 10e9/195312.5 = 6400 octets. For -oversampled subbands the maximum packet size drops to about 6400 / 1.25 = 5120 octets. If the minimum packet size is e.g. 4000 -octets, then at 10GbE this means that the link cannot be fully used, whereas at 40GbE multiple packets will still fit. The -maximum packet size for 10GbE also depends on the number of packets on the ring: -. With one packet on the ring the maximum packet size for 10GbE is 5120 (R_os = 1.25) octets, -. With N = 16 nodes and all nodes sending to the same end node the maximum packet size is 5120/16 = 320 (R_os = 1.25) octets, -. If the packet only needs to travel N/2 nodes then the maximum packet size is 5120/8 = 640 (R_os = 1.25) octets. -Design descision: Assume the ring will use 4 x 10GbE, because it is known technology and suitable. +Ring access schemes: +- 1) start node sends packet to end node, intermediate nodes modify the packet. +- 2a) each node starts sending its packets to an end node, intermediate nodes pass on the packet +- 2b) each node starts sending its packets to an end node, intermediate nodes pass on the packet + and use the packet (= multi cast) + +If both scheme 1 and 2 are suitable, then scheme 1 typically yields a larger payload, because it +reserves slots for all nodes, whereas the payload for scheme 2 only contains data from one node. +Scheme 1 and 2b are useful if the transit nodes also use or modify the packet data. The multiple +hops are then used to multi cast the data. Scheme 2a is suitable for packet transport from start +to end node, whereby transit nodes only pass on the packet. + +For the beamformer beamlets scheme 1 is most suitable. The start node prepares the packet with +the initial beamlet sums. The subsequent nodes add there local beamlet sum to the packet +beamlet sums and then pass on the packet. + +For the subband correlator both scheme 1 and scheme 2b are suitable. For scheme 1 the start node +creates a packet with slots for all nodes and fills in its own slot with its crosslets. Scheme 1 +was used in LOFAR 1.0. The subsequent nodes fill in their slots with their crosslets and also +use the packets to correlate the remote crosslets with their local crosslets. With scheme 2b +each node creates a packet with its own crosslets and sends it to N/2 nodes further. The +intermediate node pass on or remove the packets and use the packets to correlate the remote +crosslets with their local crosslets. The disadvantage of scheme 1 is that it requries a +dedicated start node that initiates the aggregate packet. With scheme 2b each node acts as start +node for its own packet. Intermediate nodes use the remote packets for correlation and pass +them on. The final destination node removes the packet. + +For the subband offload both scheme 1 and scheme 2a are suitable. For scheme 1 the start node +creates a packet with slots for all nodes and fills in its own slot with its subbands. The +subsequent nodes fill in their slots with their subbands. With scheme 2a each node creates a +packet with its own subbands and sends it to the output end node. The other nodes only pass on +the remote packets. + +For transient buffer read out scheme 2a is most suitable to gather the read out data from each +node at the output end node. -Use one packet type per ring link. -This avoids having to multiplex different packet types onto a single link. Still the Ethernet type can be used to fill -in the packet type to more easily identify data on different links of the ring. -Use application packets to monitore the link quality: -This allows monitoring the link during normal operation and avoids the need to define and control a test packet (e.g. -like ping). +Ring access directions: +The ring can be used in both directions. The forward direction is e.g. from PN0 to 15, the +backward direction is e.g. from PN 15 to 0 for N = 16 nodes. +All schemes can be used in two directions for the same type of data transport. In one direction +the maximum number of hops between start and end node is N-1, while by using both directions the +maximum number of hops between start and end node is N/2. If the data is used on all intermediate +nodes, then there is no advantage to use the ring in both directions. If the data is only passed +along by intermediate nodes, then the link capacity is used about a factor two more efficiently +by sending data in both directions. Disadvantages of using the ring in both directions for the +same type of data are that each node needs to decide which direction to use, that the data arrives +from both directions at the end node, and that it is somewhat more difficult to understand and +diagnose. -Wormhole routing or store-and-forward routing: -With worm hole routing a received packet or a received and modified packet is already transmitted, while the tail of -the packet is still being received. The advantage of wormhole routing is that it minimizes the latency along the ring -and therefore also local buffering to align between local and remote data. The disadvantage of wormhole routing is -that a CRC error on the received packet needs to be propagated by forcing the CRC of the transmitted packet to be -wrong. This implies that all subsequent hops will show this CRC error. For link diagnoses this is confusing, because -the subsequent links did not cause the CRC error. With store-and-forward routing a packet is first received entirely -before it is passed on for transmit. This allows to discard a received packet with a CRC error, but does increase the -latency on the ring. For LOFAR 2.0 choose to use store-and-forward, because it allows discarding packets with CRC -errors when they occur and because there is sufficient internal block RAM to buffer the local data for the worst case -ring latency. +Design decision : Therefore choose to use the ring in only one direction per link. -Ring latency: -The latency of 1 hop is about 0.2 us. The time to transmit one Ethernet frame of 1500 octets at 10Gbps is about 1.2 us -and a jumbo frame of 6400 octets takes about 5.12 us (= T_sub). Hence for packets >~ 300 octets the ring latency is -dominated by the store and forward routing at each node. The 10GbE Ethernet MAC uses 64 bit data. At 200 MHz this can -achieve 64 * 0.2 = 12.8 Gbps. Hence if the processing operates without data valid gaps, then the Ethernet transmit -will not run empty during a payload. Therefore it is not necessary to use a fill FIFO, which would add to the ring -latency. For a packet that travels the entire ring the latency is then about (N-1) * T_sub and the corresponding -FIFO depth to align the local data with this remote data is (N-1) * packet size. + +Use one link per packet type: +For scheme 2 use only one link for all source nodes, so do not let different source nodes use +different links. For N/2 = 8 or N = 16 the number of links would become too large. By using one +link for all sources, increasing the processing becomes a matter of using and instantiating more +links. -Only accept correct packets: -Discard all packets that have a CRC error. This also prevents that packets of wrong length enter the internal -processing. The Ethernet CRC error is 32 bit, so it is very unlikely that packet with errors still has a -correct CRC. With wormhole routing it was necessary to limit or extend a packet to a known fixed length, because -also packets with CRC error are passed on. With store-and-forward routing the CRC provides sufficient protection -to ensure that only correct packets enter the application. +Remote and local data alignment: +In APERTIF the data arrived from >= 2 remote streams. With the LOFAR ring there is always local +data that arrives first and needs to be aligned with only one remote data stream. The local data +needs to be buffered until the remote data from the farthest PN has arrived. The latency on the +ring is about 1 packet per transit hop, due to the store-and-forward. The first hop has negligible +latency. Hence with H hops the local data buffer size needs to be (H-1) * local data size. + Ring data transport schemes: - beamlets on ring: l --> r+l --> r+l --> ... --> r+l @@ -200,13 +357,10 @@ Ring data transport schemes: . output filler data if remote got lost, to preserve nominal output rate to CEP - crosslets on ring: rrrrrrrr,l --> rrrrrrrr,l --> ... --> rrrrrrrr,l - . on each node separately align N/2 pairs of inputs l,r, have one pair per XC cell - or - . on each node first align all inputs l,N/2*r, and then split into N/2 pairs of l,r to have one pair per XC cell - . discard output data if remote got lost, to count number of active blocks per integration sync interval - or - output filler data if remote got lost, and use zero to not disturb the intergation and count unflagged blocks - to know the number of active blocks per integration sync interval + . on each node first align all inputs l,N/2*r, and then split into N/2 pairs of l,r to have one + pair per XC cell (or on each node separately align N/2 pairs of inputs l,r, have one pair per + XC cell). output filler data if remote got lost, and use zero to not disturb the intergation + and count unflagged blocks to know the number of active blocks per integration sync interval. - subbands on ring: l, rl, rrl, rrrl, ..., rrrrrrrrrrrrrrrl . on final node align all l,(N-1)*r inputs @@ -216,56 +370,6 @@ Ring data transport schemes: . no align, readout from one node at a time -Ring access schemes: - -- 1) start node sends packet to end node, intermediate nodes modify the packet. -- 2a) each node starts sending its packets to an end node, intermediate nodes pass on the packet -- 2b) each node starts sending its packets to an end node, intermediate nodes pass on the packet and use the packet (= multi cast) - -If both scheme 1 and 2 are suitable than scheme 1 typically yields a larger payload, because it reserves slots for all -nodes, whereas the payload for scheme 2 only contains data from one node. Scheme 1 and 2b are useful if the transit nodes -also use or modify the packet data. Scheme 2a is suitable for packet transport from start to end node, whereby transit -nodes only pass on the packet. - -For the beam former beamlets scheme 1 is most suitable. The start node prepares the packet with the initial beamlet sums. -The subsequent nodes add there local beamlet sum to the packet beamlet sums and then pass on the packet. - -For the subband correlator both scheme 1 and scheme 2b are suitable. For scheme 1 the start node creates a packet with -slots for all nodes and fills in its own slot with its crosslets. Scheme 1 was used in LOFAR 1.0. The subsequent nodes fill in -their slots with their crosslets and also use the packets to correlate the remote crosslets with their local crosslets. -With scheme 2b each node creates a packet with its own crosslets and sends it to N/2 nodes further. The intermediate node -pass on the packets and use the packets to correlate the remote crosslets with their local crosslets. - -For the subband offload both scheme 1 and scheme 2a are suitable. For scheme 1 the start node creates a packet with slots for all -nodes and fills in its own slot with its subbands. The subsequent nodes fill in their slots with their subbands. With scheme 2a -each node creates a packet with its own subbands and sends it to the output end node. The other nodes only pass on the remote packets. - -For transient buffer read out scheme 2a is most suitable to gather the read out data from each node at the output end node. - - -Ring access directions: -All schemes can be used in two directions for the same type of data transport. In one direction the maximum number -of hops between start and end node is N-1, while by using both directions the maximum number of hops between start -and end node is N/2. If the data is used on all intermediate nodes, then there is no advantage to use the ring in -both directions. If the data is only passed along by intermediate nodes, then the link capacity is used -about a factor two more efficiently by sending data in both directions. Disadvantages of using the ring in both -directions for the same type of data are that each node needs to decide which direction to use, that the data -arrives from both directions at the end node, and that it is somewhat more difficult to understand and diagnose. -Design decision : Therefore choose to use the ring in only one direction per link. - -Use one link per packet type: -For scheme 2 use only one link for all source nodes, so do not let different source nodes use different links. For -N/2 = 8 or N = 16 the number of links would become too large. By using one link, increasing the processing becomes -a matter of using and instantiating more links. - - -Remote and local data alignment: -In APERTIF the data arrived from >= 2 remote streams. With the LOFAR ring there is always local data that arrives -first and needs to be aligned with only one remote data stream. The local data needs to be buffered until the remote -data from the farthest PN has arrived. The latency on the ring is about 1 packet per transit hop, due to the store -and forward. The first hop has negligible latency. Hence with H hops the local data buffer size needs to be (H-1) * -local data size. When the remote data arrive the local data is popped from the buffer. It the remote data has not -arrived in time, then the local data is popped from the buffer when the next local data is pushed into the buffer. ******************************************************************************* @@ -273,139 +377,211 @@ arrived in time, then the local data is popped from the buffer when the next loc ******************************************************************************* What is the beamlet packet size? -The beamlet sum is passed on along the ring from start PN to end PN using ring access scheme 1. At the end PN the -final beamlet sum is scaled to W_beamlet = 8 bit and output to CEP. The intermediate beamlet sum has W_beamlet = -18 bit and is complex. There are N_pol * S_sub_bf = 2 * 488 = 976 beamlets per packet. The payload size is -N_pol * S_sub_bf * N_complex * W_beamlet_sum / W_byte = 2 * 488 * 2 * 18 / 8 = 4392 octets. The effective packet -size is 40 + 4392 = 4432 octets. With f_sub = 195312.5 Hz and R_os = 1.28 the data rate is 4432 * 195312.5 * 1.28 -* 8 = 8.864 Gbps, which fits on a 10GbE link. +The beamlet sum is passed on along the ring from start PN to end PN using ring access scheme 1. At +the end PN the final beamlet sum is scaled to W_beamlet = 8 bit and output to CEP. The intermediate +beamlet sum has W_beamlet = 18 bit and is complex. There are N_pol * S_sub_bf = 2 * 488 = 976 +beamlets per packet. The payload size is N_pol * S_sub_bf * N_complex * W_beamlet_sum / W_byte = +2 * 488 * 2 * 18 / 8 = 4392 octets. The effective packet size is 60 + 4392 = 4452 octets. With +f_sub = 195312.5 Hz the data rate is 4452 * 195312.5 * 8 = 6.95625 Gbps < L_lane = 7.8125, so it +fits on a 10GbE lane. Packet decoding and encoding: -The start node encodes the packet and the end node decodes the packet. The intermediate nodes could operate on -the encoded packet, however the payload beamlets are packed into bytes and are not word aligned. Therefore the -intermediate nodes also need to decode the packet to be able to update the payload data, and then encode the -packet. The decode and encode function is available in any node, because all nodes run the same firmware image. -Therefore the decoding and encoding at intermediate nodes can reuse the encoding function of the start node and -the decode function of the end node, so no extra logic is needed. +The start node encodes the packet and the end node decodes the packet. The intermediate nodes could +operate on the encoded packet, however the payload beamlets are packed into bytes and are not word +aligned. Therefore the intermediate nodes also need to decode the packet to be able to update the +payload data, and then encode the packet again. The decode and encode function is available in any +node, because all nodes run the same firmware image. Therefore the decoding and encoding at +intermediate nodes can reuse the encoding function of the start node and the decode function of the +end node, so no extra logic is needed. Ring adder payload processing: -The station beam is a dual polarization beam and each beam has S_sub_bf = 488 beamlets, so in total there are -976 complex beamlets per subband period of N_fft = 1024 cycles @ 200 MHz. For an oversampled filterbank with -R_os = 4/3 there are N_fft / R_os = 768 cycles @ 200 * R_os MHz. Hence to be compatible with an oversampled -filter bank the beamformer cannot process all 976 beamlets in series, instead it has to apply ceil(R_os) = 2 -streams in parallel that each process 488 beamlets. Therefore to support the oversampled beamlets the paylaod -needs to be encoded from and decoded to two streams of beamlets: - - 0 : 0 2 4 ............. 974 - 1 : 1 3 5 ............. 975 - -The 10Gbps data on the ring interface is available as 32 bit data at 312.5 MHz (32 * 312.5M = 10G). +The full band station beam has S_sub_bf = 488 beamlets per polarization, so in total there are +N_pol * S_sub_bf = 2 * 488 = 976 complex beamlets per subband period of N_fft = 1024 cycles @ +200 MHz. For an oversampled filterbank with R_os > 1 the processing rate is increased to +200 * R_os MHz, so there are still N_fft = 1024 cycles available to process 976 beamlets. The ring +adder adds the local beamlet sum to the received beamlet sum and passes on the result. The beamlet +sum is received as a packet with 64 bit packed data at 156.25 MHz (64 * 156.25M = 10G). The 976 +beamlets fit in 976 * 18b * 2 / 64b = 549 64b words. The packet is processed at 200 * R_os MHz. + + . from 10GbE --> + . Rx packet 64b @ 156MHz --> Rx FIFO to dp_clk domain --> + . Rx packet 64b @ 200MHz --> DP/ETH decode to discard or extract payload of 549 words--> + . Rx payload 64b @ 200MHz --> repack 549 words to 976 beamlets --> + . Align remote and local beamlets --> + . Sum remote and local beamlets --> repack 976 beamlets to 549 words --> + . Tx payload 64b @ 200MHz --> DP/ETH encode to add header and tail --> + . Tx packet 64b @ 200MHz --> Tx FIFO to tx_clk domain --> + . Tx packet 64b @ 156MHz --> + . to 10GbE + + +? Does align belong to ring or to beamlet ring adder? +--> to beamlet ring adder: + - to avoid having an align input and output on the ring interface. + - implies that align monitor also belongs to beamlet ring adder +? Does sum belong to ring or to beamlet ring adder or to local beamformer? +--> to beamlet ring adder: + - it deserves a dedicated block, because it is art of the BF (so not of the ring) and it only + adds (so does not have BF weigths like the local BF). + Local beamlet sums FIFO size: -The local subband data needs to be buffered until the beamlet sum arrives. The last node experiences the largest -latency, because then the beamlet sum has travelled N-1 hops, each adding about 5888 * 8 / 10G = 4.71 us. The -total latency for the LBA ring is (16 - 1) * 4.71 us = 70.6 us or about 14 T_sub. With some extra margin assume -that the last N-1 or N local beamlets need to be buffered. Per PN this yields a FIFO size of N_pol * S_sub_bf * -N * N_complex * W_subband = 2 * 488 * 16 * 2 * 18 = 562176 bit, which takes about 32 M20k block RAMs. +The local subband data needs to be buffered until the beamlet sum arrives. The size of the buffer +is determined by last node, because then the beamlet sum has travelled N-1 hops. For each hop the +packet is delayed by: + - packet encoding + - packet transport over the ring +After each hop the packet is delayed by: + - store-and-forward to be able to check the CRC + - packet decoding + - packet processing +The store-and-forward causes a latency of one block period (T_sub) per hop and is the dominant +factor in the latency. During this latency N-1 local blocks need to be buffered. Assume that the +processing and transport delays are shorter than one block period, so buffering one extra local +block is sufficient to compensate it. Per PN this yields a FIFO size of N_pol * S_sub_bf * N * +N_complex * W_subband = 2 * 488 * 16 * 2 * 18 = 562176 bit, which takes about 32 M20k block RAMs. Ring modes: - off - local - remote - combine -With dp_bsn_align all these modes are supported by enabling/disabling the corresponding inputs. - -FIFO flush: -A FIFO can be flushed by resetting it, but this requires careful control to ensure that the reset is noticed -in both clock domains, and that the reset is applied in between input packets to avoid that only a tail -of a packet gets into a FIFO. Therefore in LOFAR 1.0 and APERTIF a FIFO is flushed by reading the packets -from it until it is empty. This scheme also allows flushing per packet. The disadvantage of reading the -packets and the discard them, is that it takes as long as reading at full speed. - -Lost remote packet detection: -Local FIFO full: -The local FIFO needs to buffer the local data to be able to align with the remote data. The latency between -nodes depends on the number of hops. With N = 16 nodes and store and forward packet transport the maximum -latency will be < N * T_sub. To compensate for this latency the local FIFO needs to be able to store at most -about N local packets. If the FIFO runs full, then this is an indicator that remote packets got lost and -then the local FIFO needs to be flushed until it is empty. -Rx timeout: -The average packet rate on the ring is f_sub, so within T_sub there should arrive a new packet. If no packet -arrives within T_sub, then the local FIFO can flush one packet. In this way the local FIFO does not need to -be flushed until empty and less packets will get lost once the remote packets arrive again. Using Rx timeout -does rely on that packets fit within a T_sub interval and that every T_sub interval contains at least part -of a packet, so the actual packet rate must be close to the average packet rate. - - -Remote packets: -The remote packets drive the ring adder and are processed on arrival. The local packet with the same time stamp -is already pending in the local beamlets FIFO. If a burst of remote packet gets lost, then the node will -notice this because its local beamlets keep arriving and will overflow the local beamlets FIFO. The node will -read and discard packets from the local beamlets FIFO to make sure that the FIFO does not overflow. If only -one or a few remote packets got lost, then the node will noticethis during the time stamp alignment, but -only as soon as the next packet has arrived. This next packet will be ahead of the local packet, so the local -packets need to be flushed. The node will then read and discard packets from the local beamlets FIFO until it -can align the remote and local data. During this realignment process the next remote packet may already arrive -as well. Therefore the remote packet needs to be buffered, or discarded. Assume the FIFO is flushed by reading -and then discarding packets from it. The local packets and the remote packets arrive at the same rate. If the -flushing of the packets goes faster then reading them, because flushing can use all clock cycles. The flushing -can only catch up if the gaps between packets are large enough. Therefore in LOFAR 1.0 the remote packets were -discarded during the flushing. This does mean that when one packet gets lost, the flushing will also discard -the next packet and some more for as long as it takes to empty the local beamlets FIFO. An alternative would -be to keep on flushing and discarding remote packets, until the local beamlet FIFO is again ahead of the -remote packets. Typically packets will get lost rarely or in bursts. In both cases it is fine to just flush -the local beamlet FIFO until it is empty. - PN0 PN1 PN2 PN3 PN4 -t -0: L0 L1 L2 L3 L4 <-- S_sub_bf = 488 beamlets (dual pol complex) per packet - R4 R0 R1 R2 R3 - R3 R4 R0 R1 R2 +With dp_bsn_align_v2 all these modes are supported by enabling/disabling the corresponding inputs. + The beamformer function has the following sub functions: - "Beamlet subband select" : Select S_sub_bf = 488 subbands per signal input -- "Local beamformer" : Form N_pol * S_sub_bf = 2 * 488 = 976 local beamlet sums for S_pn = 12 signal inputs +- "Local beamformer" : Form N_pol * S_sub_bf = 2 * 488 = 976 local beamlet sums for + S_pn = 12 signal inputs - "Beamlet ring adder" : if start node: - - Encode beamlet sums packet to ring + - Encode local beamlet sums packet to ring else: - - Buffer the local beamlet sums for >= N subband intervals + - Buffer the local beamlet sums for ~= N subband intervals - Decode remote beamlet sums packet from ring - Align remote beamlet sums packet and local beamlet sums packet - Add local beamlet sums to remote beamlet sums packet if transit node: - Encode beamlet sums packet to ring else: - - "Beamlet data output" : Scale and output beamlet sums -- "Beamlet statistics (BST)": Calculate BST + - "Beamlet data output" : On output node scale and output final beamlet sums +- "Beamlet statistics (BST)": Calculate BST for beamlet sums, output node has final BST ******************************************************************************* * Subband Correlator ******************************************************************************* -Crosslet transport scheme: -Use transport scheme 2b with N/2 hops where every node sends its local crosslets N/2 hops. The remote crosslets -are correlated with the local crosslets. The remote crosslets arrive in packets from the N/2 preceding nodes. -First the local crosslets are correlated with themselves and then the local crosslets are kept in a barrel shifter, -such that they can also be correlated with the remote crosslets that arrive in the packets. -- count N_int for monitoring +With transport scheme 1 crosslets from different source nodes are combined into one packet. +Scheme 2b packs only local crosslets into a packet. Compared to scheme 1, scheme 2b: +- treats the local crosslets and remote independently +- has small payload and thus more packet overhead, but the load still fits on a lane +- has small payload that can be enlarged by transporting more local crosslets, to support + a subband correlator with N_crosslets > 1. +Design decision: + Use transport scheme 2b with N/2 hops where every node sends its local crosslets N/2 hops, + because it is more flexible to have only local crosslets per packet. -Square correlator cell: -There are S_pn = 12 local crosslets. A packet contains S_pn = 12 remote crosslets. There are N/2 remote crosslet -packets. The local crosslets have to be correlated with the local crosslets and with each of the remote crosslet -packets. The correlation with the local crosslets is a square matrix that yields X_sq = S_pn * S_pn = 144 visibilities. Number of square correlator cells per PN: -With N = 16 PN for LBA there are N/2 = 8 remote crosslet packets. Hence together with the local crosslet visibilities -this yields X_pn = (N/2 + 1) * X_sq = (8 + 1) * 144 = 1296 visibilities per PN. +There are S_pn = 12 local crosslets. A packet contains S_pn = 12 remote crosslets. There are N/2 +remote crosslet packets. The local crosslets have to be correlated with the local crosslets and +with each of the remote crosslet packets. The correlation with the local crosslets is a square +matrix that yields X_sq = S_pn * S_pn = 144 visibilities. For the local-local square correlator +cell the efficiency is (S_pn * (S_pn+1)) / 2 / X_sq = 54%, but for the N/2 other local-remote +square correlator cells the efficiency is 100 %. With N = 16 PN for LBA there are N/2 = 8 remote +crosslet packets. Hence together with the local crosslet visibilities this yields +X_pn = (N/2 + 1) * X_sq = (8 + 1) * 144 = 1296 visibilities per PN. + + +Number of multipliers per crosslet: +The subband correlator needs to finished within one subband period, so within N_fft = 1024 clock +cycles. The X_pn = 1296 visibililies per PN can be caluculated using one complex multiplier if +the complex multiplier runs at 1296 / 1024 * 200 M > 253 MHz. For an oversampled filterbank with +R_os <= 1.28 this requires 324 MHz, which is too much. All X_pn = 1296 can be calculated using +two complex multipliers running at > 161 MHz. However another option is to use one pultiplier +per X_sq = 144 visibilities, so one complex multiplier per correlator cell and N/2 + 1 = 9 +correlator cells in parallel. The FPGA has sufficient multipliers to support this scheme and the +spare capacity of each correlator cell can be used to support a subband correlator with more +than 1 subband per integration interval, so N_crosslets > 1. + +Design decision: + Use 1 + N/2 parallel correlator cells, for the local-local visibilities and for the local- + remote visibilitie for each remote source. + + +What is the crosslet packet size? +With S_pn = 12 signal inputs per PN and one crosslet per signal input there are 12 crosslets per +packet. A crosslet is a W_crosslet = 16 bit complex value, so 12 * 4 = 48 octets payload, so the +effective packet size is p_packet = 60 + 48 = 108 octets. The relative packet overhead for single +crosslet payloads is P_overhead / P_packet = 60 / 108 = 55 %. + +Maximum number of crosslets per lane: +There are f_sub = 195312.5 subbands per s, and the packets have to travel N/2 hops. This yields +a packet load of P_packet * f_sub * N/2 = (108 * 8b) * 195312.5 * 16 / 2 = 1.35 Gbps. The data +load of only the payload data is payload size * f_sub * N/2 = (48 * 8b) * 195312.5 * 16 / 2 = +0.6 Gbps. Hence the small packet size causes a large packet overhead, but is still acceptable, +since it is < L_lane = 7.8125 Gbps so it fits on a single 10G lane of the ring. +Multiple local crosslets could be transported via seperate packets, a lane can then fit about +7.8125 / 1.35 ~= 5 different crosslets. Packing the local crosslets into a single payload +reduces the packet overhead. The maximum number of crosslets per packet follows from +(P_overhead + X * 48 * 8b) * f_sub * N/2 < L_lane. For N = 16 this yields X ~= +(7.8125 Gbps / (16/2) / 195312.5 - 60) / (48 * 8b) = 12. With 12 crosslets the payload size is +16 * 48 = 576 and the effective packet size is P_packet = 60 + 576 = 636 octets. The relative +packet overhead for multi crosslet payloads is P_overhead / P_packet = 60 / 636 ~= 9.4%. The +packet load for multi crosslet payloads is (636 * 8b) * 195312.5 * 16/2 = 7.95 Gbps < +L_lane = 7.8125 Gbps, so this just not fits on a 10GbE lane, due to the still significant packet +overhead. Using X = 11 instead of 12 crosslets per packet yields a total crosslet packet load +per lane of ((60 + 11 * 48) * 8b) * 195312.5 * 16/2 = 7.35 Gbps, which does fit on a lane. + +Design decision: + Pack local crosslets into a single payload if N_crosslets > 1, because then teh packet overhead + is much reduced to support transporting more crosslets per lane (11 instead of 5). + + +Maximum number of crosslets per correlator cell: +A X_pn correlator cell can correlate N_fft / X_sq = 1024 / 144 = 7 different crosslets frequencies. +With N = 16 for LBA, there need to be N/2 + 1 = 9 of these X_pn correlator cells in parallel. One +X_pn correlates the local-local crosslets and the other N/2 X_pn correlates the local-remote +crosslets. These 9 X_pn in parallel can correlate up to 7 crosslets. The link can transport +maximum 11 crosslets. Hence the processing capacity of 9 X_pn is less than the IO capacity of 1 +10GbE lane, therefore 9 X_pn in parallel can correlate 7 different crosslets. The crosslet data rate +on a lane is then ((60 + 7 * 48) * 8b) * 195312.5 * 16/2 = 4.95 Gbps, so a utilization of 4.95 / +7.8125 = 63 %. Another set of 9 X_pn could be used to correlate the remaining 11- 7 = 4 crosslets +that can be transported via the ring. However, if more than N_crosslet = 7 crosslets need to be +correlated in parallel per integration, then it is easier to allocate an extra lane and to +instantiate an extra set of 9 X_pn to correlate 14 crosslets in parallel in total. + +One X_pn takes one complex multiplier. For N_crosslets = 1 crosslet per integration interval using +N/2+1 = 9 X_pn uses only 144 / 1024 = 14% of the processing resources. However this is acceptable +because: +- the FPGA has sufficient multipliers +- it provides a clear design +- the spare capacity can be used to process more crosslets per integration interval + +Design decision: + Use 1 + N/2 = 9 parallel correlator cells to correlate N_crosslets = 1 crosslet, or upto 7 + crosslets in parallel, per integration interval. + + + +Send more than one time slot per packet? +To reduce the relative packet overhead for single crosslet XC it is an option to put multiple +time slots per payload. Design decision: This is considered to complicated. + + +What if a packet gets lost? +The local crosslets cannot get lost, but remote packets may get lost. The BSN aligner will repolace +lost remote packets with filler packets that are flagged. The crosslets in the filler packets +contain zero data, so in the correlator they do not contribute to the visibilities. Each correlator +cell has to count the number of valid and flagged crosslets per integration interval. The number +valid crosslets N_valid can be used to weight the visibility relative to the expected number of +N_int crosslets. The number of flagged crosslet N_flagged is used for monitoring. For every +integration interval N_valid + N_flagged = N_int, by design of the BSN aligner. + -Crosslet period: -The subband correlator needs to finished within one subband period, so T_xc < T_sub. For the critically sampled -filterbank the subband period is N_fft = 1024 sample periods. The X_pn = 1296 visibililies per PN can be -caluculated using one complex multiplier if the multiplier runs at 1296 / 1024 * 200 M > 253 MHz. For an oversampled -filterbank with R_os <= 1.25 this requires 1.25 * 253 = 317 MHz, which may be too much. Time in diagrams: - equal time for all PN in same row and in same relative column @@ -441,81 +617,11 @@ N_int-1: <-- Dump and restart XST: - not calculated because conj() -What is the crosslet packet size? -With S_pn = 12 signal inputs per PN and one crosslet per signal input there are 12 crosslets per packet. A crosslet is -a W_crosslet = 16 bit complex value, so 12 * 4 = 48 octets payload, so the effective packet size is 40 + 48 = 88 octets. -The relative packet overhead for single crosslet payloads is 40 / 88 = 45 %. - -There are f_sub = 195312.5 subbands per s, and the packets have to travel N/2 hops. This yields a packet load of -packet size * f_sub * N/2 = (88 * 8b) * 195312.5 * 16 / 2 = 1.1 Gbps. The data load of only the payload data is -payload size * f_sub * N/2 = (48 * 8b) * 195312.5 * 16 / 2 = 0.6 Gbps. Hence the small packet size causes a large -packet overhead, but is still acceptable, since it fits on a single 10G link of the ring. - -Calculate one or multiple crosslets: -With small payloads the 10G link could fit about 10/1.1 ~= 8 different crosslets. With larger payloads the 10G link -could fit about 10 / 0.6 = 16 crosslets. The advantage of using small payloads is that adding more crosslets can be done -by instantiating the same single crosslets XC multiple times. However the small packets do have to travel sequentially -via the same 10G link, so there needs to be a multiplexer after that the local ETH frames have been made. The advantage of -using larger payloads is that they can be made by putting the extra crosslets in the same payload. With 16 crosslets -the payload size is 16 * 48 = 768 and the effective packet size is 40 + 768 = 808 octets. The relative packet overhead for -multi crosslet payloads is 40 / 808 ~= 5 %. The packet load for multi crosslet payloads is (808 * 8b) * 195312.5 * 16 / 2 = -10.1 Gbps, so this will just not fit on a 10GbE link, but 15 crosslets would. - -At 200 MHz for the critically sampled subbands, a X_pn correlator cell can correlate N_fft / X_sq = 1024 / 144 = 7 -different crosslets frequencies. With N = 16 for LBA there need to be N/2 + 1 = 9 of these X_pn correlator cells in -parallel. One X_pn correlates the local-local crosslets and the other N/2 X_pn correlates the local-remote crosslets. -These 9 X_pn in parallel can correlate up to 7 crosslets. The link can transport 15 crosslets, so 18 X_pn in parallel -could correlate 14 different crosslets to make better use of the link capacity. - -One X_pn takes one complex multiplier. For one crosslet using N/2+1 = 9 X_pn is a waste of resources, but still -acceptable and providing a clear design. - - -Send more than one time slot per packet? -To reduce the relative packet overhead for single crosslet XC it is an option to put multiple time slots per payload. -This is considered to complicating. - - PN0 PN1 PN2 PN3 PN1 -t -0: L00 L11 L22 L33 L44 <-- For example two time slots per packet - R44 R00 R11 R22 R33 - R33 R44 R00 R11 R22 -2: -What if a node fails? -The next N/2 nodes will then miss packets. The order of the packets is not affected, because on each node it will -be the last one or more packets that are missed. There will be no correlations for the missed packets, but the -correlation should continue if the next time slot the node starts again. A packet count per packet source at each -node will reveal missed packets and thus also the number of integrations that happened in the final visibilities. -If no packets are missed then the packet count is 195312.5 per integration interval on every PN for every packet -source PN. - - PN0 PN1 PN2 PN3 PN1 -t -0: L0 . L2 L3 L4 <-- PN1 fails, so next N/2 nodes will miss packets - R4 . . R2 R3 - R3 . . . R2 - 00 . 22 33 44 - 04 . . 32 43 - 03 . . . 42 -What if a packet gets lost? -If a packet gets lots then it can cause a gap in the packet order, so the next packet must not be mistaken as -the lost packet. Therefore the packets must have a time slot number and a source number, such that the XST in -each node will use it for the correct visibilities. - PN0 PN1 PN2 PN3 PN1 -t -0: L0 L1 L2 L3 L4 - R4 . R1 R2 R3 <-- L0 from PN0 gets lost at PN1 - R3 R4 . R1 R2 - -Packet order is guarantueed? -At the start of every time slot the local L# packet is send first. After that each node passes on the packets that -it receives. Therefore the packets arrive in order with packet from closest node first and from furtherst node -last. If a packet gets lost then there will be a gap, but the order is still preserved. What if T_sq > T_hop latency on ring? What if T_sub > N/2 * T_hop latency on ring? diff --git a/applications/lofar2/doc/prestudy/station2_to_do_erko.txt b/applications/lofar2/doc/prestudy/station2_to_do_erko.txt index ee6be7b14d..b7862111df 100755 --- a/applications/lofar2/doc/prestudy/station2_to_do_erko.txt +++ b/applications/lofar2/doc/prestudy/station2_to_do_erko.txt @@ -182,15 +182,16 @@ git remote remove <remote name> # remove a remote repo ******************************************************************************* Open issues: - Central HDL_IO_FILE_SIM_DIR = build/sim --> Project local sim dir -- avs_eth_coe.vhd per tool version? Because copying avs_eth_coe_<buildset>_hw.tcl to $HDL_BUILD_DIR copies the - last <buildset>, using more than one buildset at a time gices conflicts. +- avs_eth_coe.vhd per tool version? Because copying avs_eth_coe_<buildset>_hw.tcl to $HDL_BUILD_DIR + copies the last <buildset>, using more than one buildset at a time gices conflicts. ******************************************************************************* * To do: ******************************************************************************* -- Check that the Expert users (MB, SJW, MN), Maintainers (HM) and Local users are happy with the design decisions +- Check that the Expert users (MB, SJW, MN), Maintainers (HM) and Local users are happy with the + design decisions - H6 M&C loads section - H3 Functions mapping - H3/4 Timing (1s default, PPS, event message) @@ -225,7 +226,7 @@ Open issues: - Update RadioHDL docs - Write RadioHDL article - Write HDL RL=0 article - desp_hdl_design_article.txt - +- XST : SNR = 1 per visibility for 10000 samples, brigthtest sourcre log 19.5 --> 4.5 dB --> T_int = 1 s is ok. -- GitLab