From deba712170310be1f7135acf362898e10309bcd1 Mon Sep 17 00:00:00 2001
From: Eric Kooistra <kooistra@astron.nl>
Date: Fri, 22 Nov 2019 15:37:33 +0100
Subject: [PATCH] Updated BF and XC part of station2_sdp_ring.txt.

---
 .../lofar2/doc/prestudy/station2_sdp_dsp.txt  |   5 +-
 .../prestudy/station2_sdp_hdl_components.txt  |  10 +-
 .../lofar2/doc/prestudy/station2_sdp_ring.txt | 838 ++++++++++--------
 .../doc/prestudy/station2_to_do_erko.txt      |   9 +-
 4 files changed, 489 insertions(+), 373 deletions(-)

diff --git a/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt b/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt
index 993d1734ff..31bf474f7e 100644
--- a/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt
+++ b/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt
@@ -44,11 +44,14 @@ M&C:
     
     The (x+y) could be implemented as first (x+y) and then *w, or as first weight and then add. 
 
+
 *******************************************************************************
 * Subband correlator
 *******************************************************************************
 
-
+First the local crosslets are correlated with themselves and then
+the local crosslets are kept in a barrel shifter, such that they can also be correlated with the
+remote crosslets that arrive in the packets.
 
 
 *******************************************************************************
diff --git a/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt b/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt
index c82daa7080..cc41dda277 100644
--- a/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt
+++ b/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt
@@ -44,7 +44,7 @@
 - RSP RAD frame:
   . uses: FSI, FSN, DATA, CRC.
   . The FSN is 16 bit but the MSbit is used for the sync. The other 15 bits count blocks.
-  . After Rx frame the FSI is stripped and the CRC is replace by a BRC.
+  . After Rx frame the FSI is stripped and the CRC is replace by a boolean check (BRC).
 
 - CRC Error checking:
   The CRC is a 32 bit number, so the chance that the CRC results in a false positive is 1/2**32 ~= 2.3e-10 or 1
@@ -123,7 +123,7 @@ to ensure that all inputs have the same 64 bit sync and BSN.
 
 
 *******************************************************************************
-* BSN aligner 
+* BSN aligner dp_bsn_align_v2
 *******************************************************************************
 
 Assumptions:
@@ -441,6 +441,12 @@ Design options:
     - flush per packet or flush until empty?
     - flush per input per input or flush all inputs?
     - flush by reading, or by reset or by moving a Rd pointer
+      A FIFO can be flushed by resetting it, but this requires careful control to ensure that the reset is
+      noticed in both clock domains, and that the reset is applied in between input packets to avoid that
+      only a tail of a packet gets into a FIFO. Therefore in LOFAR 1.0 and APERTIF a FIFO is flushed by
+      reading the packets from it until it is empty. This scheme also allows flushing per packet. The
+      disadvantage of reading the packets and the discard them, is that it takes as long as reading at full
+      speed.
     - Use packet count instead of FIFO full indicator
     - can we do without flushing the FIFO? Not if we need to realign.
     - If multiple packets on a remote input get lost, then the other inputs fill up if there is no timeout. Flush
diff --git a/applications/lofar2/doc/prestudy/station2_sdp_ring.txt b/applications/lofar2/doc/prestudy/station2_sdp_ring.txt
index b484c32f0e..90e1fe9030 100644
--- a/applications/lofar2/doc/prestudy/station2_sdp_ring.txt
+++ b/applications/lofar2/doc/prestudy/station2_sdp_ring.txt
@@ -2,60 +2,118 @@ Detailed design: RING
 
 
 *******************************************************************************
-* Data format
+* Data rate
 *******************************************************************************
 
 Support for oversampled subband filterbank
-The oversampling increases the processing rate and data rate by a factor R_os. Typical R_os are 32/28 = 1.142, 
-32/27 = 1.185, 32/26 = 1.231, 32/25 = 1.28, 32/24 = 1.333. Assume R_os <= 1.28.
-
-Assume the processing for critically sampled filterbank runs at 200 MHz and for oversampled subbands it will run at
-R_os * 200 MHz. For R_os = 1.28 this requires processing at >= 256 MHz. In this way if the processing fits for the
-critically sampled subbands, then it will also fit for the oversampled subbands.
-
-The IO data rate on the ring increases with the oversampling factor R_os.  For oversampled data the ring 10GbE has
-the full 10 Gbps capacity and for critically sampled data the effective ring capacity becomes 10G / R_os = 
-10G / 1.28 = 7.8125 Gbps. The aim is to be able to replace the critically sampled filterbank by an oversampled
-filterbank without having to change other parts in the design. Therefore assume that the ring capacity for the
-critically sampled data is restricted to 7.8125 Gbps. The alternative to use full ring capacity for critically 
-sampled data and then support less (S_sub_bf / R_os = 488 / 1.28) beamlets for oversampled data is not compliant
-with the requirement of S_sub_bf = 488.
+The oversampling increases the processing rate and data rate by a factor R_os. Typical R_os are
+32/28 = 1.142, 32/27 = 1.185, 32/26 = 1.231, 32/25 = 1.28, 32/24 = 1.333. Assume R_os <= 1.28.
+
+Processing capacity per subband period:
+Assume the processing for critically sampled filterbank runs at 200 MHz and for oversampled
+subbands it will run at R_os * 200 MHz. For R_os = 1.28 this requires processing at >= 256 MHz.
+This means that the processing has N_fft = 1024 clock cycles avaiable per subband period T_sub,
+independent of R_os. In this way if the processing for the critically sampled subbands fits
+within N_clk = N_fft = 1024 clock cycles, then it will also fit for the oversampled subbands.
+
+IO capacity per 10GbE lane:
+The IO data rate on the ring increases with the oversampling factor R_os. For oversampled data
+the ring 10GbE has the full 10 Gbps capacity and for critically sampled data the effective
+ring capacity per lane becomes L_lane = 10G / R_os = 10G / 1.28 = 7.8125 Gbps. The aim is to
+be able to replace the critically sampled filterbank by an oversampled filterbank without
+having to change other parts in the design. Therefore assume that the ring capacity for the
+critically sampled data is restricted to L_lane < 7.8125 Gbps.
+
+Note:
+The alternative to use full ring capacity for critically sampled data and then support less
+(S_sub_bf / R_os = 488 / 1.28 = 381, so almost 30 % less) beamlets for oversampled data is not
+compliant with the requirement of S_sub_bf = 488.
 
 Design descision: Support S_sub_bf = 488 also for maximum R_os = 1.28.
 
 
 W_beamlet_sum
-LOFAR 1.0 had 24 bit for 16 bit beamlet mode and 12 bit for 8 bit beamlet mode. LOFAR 2.0 will only support 8 bit.
-Using W_beamlet_sum = 18 bit provides 5 bits more dynamic range for 8 bit beamlet mode, which is sufficient to
-detect overflow. Using W_beamlet_sum = 18 bit also fits the input data width of the FPGA hard core multipliers in
-the BST. Given that the signal input level is 4 bit the beamformer could round 2 LSbit to effectively achieve
-20 bit dynamic range, even for S = 1 signal input. However the same effect can also be achieved by reducing the
-beamlet weights by a factor 2**2 = 4. Choose the same W_beamlet_sum = 18 bit for both the critically sampled 
-beamlet data and the oversampled beamlet data, to avoid differences in the design. 
-The beamlet sum that is transported across the ring needs to fit on a 10GbE link. With S_sub_bf = 488 and R_os <=
-1.28 the data rate for one full band station beam is N_pol * S_sub_bf * f_sub * R_os * N_complex * W_beamlet_sum
-= 2 * 488 * 195312.5 * 1.28 * 2 * 18 = 8.784 Gbps. This leaves about 13.8 % margin for packet overhead, which is
-sufficient. Using W_beamlet_sum = 18 bit fits the input data width of the FPGA hard core multipliers and also 
-provides sufficent dynamic range to scale the final beamlet sum to W_beamlet = 8 bit for output.
-
-Design descision: W_beamlet_sum = 18 bit for both critically sampled beamlet and oversampled beamlets
+LOFAR 1.0 had 24 bit for 16 bit beamlet mode and 12 bit for 8 bit beamlet mode. LOFAR 2.0 will
+only support 8 bit. Using W_beamlet_sum = 18 bit provides 5 bits more dynamic range for 8 bit
+beamlet mode, which is sufficient to detect overflow. Using W_beamlet_sum = 18 bit also fits the
+input data width of the FPGA hard core multipliers in the BST. Given that the SDP signal input
+level is 4 bit the beamformer could round 2 LSbit to effectively achieve 20 bit dynamic range,
+even for S = 1 signal input. However the same effect can also be achieved by reducing the beamlet
+weights by a factor 2**2 = 4. Choose the same W_beamlet_sum = 18 bit for both the critically
+sampled beamlet data and the oversampled beamlet data, to avoid differences in the design. The
+beamlet sum that is transported across the ring needs to fit on a 10GbE lane. With S_sub_bf = 488
+the data rate for one full band station beam is N_pol * S_sub_bf * f_sub * N_complex *
+W_beamlet_sum = 2 * 488 * 195312.5 * 2 * 18 = 6.8625 Gbps. Using L_lane = 7.8125 Gbps this leaves
+about 1 - 6.8625 / 7.8125 = 12 % margin for packet overhead, which is sufficient.
+
+
+Design descision:
+  Use W_beamlet_sum = 18 bit for both critically sampled beamlet and oversampled beamlets.
+  Using W_beamlet_sum = 18 bit fits the on one 10GbE lane on the ring, fits the input data width
+  of the FPGA hard core multipliers and also provides sufficent dynamic range to scale the final
+  beamlet sum to W_beamlet = 8 bit for output.
+
 
 
 *******************************************************************************
-* Ring function
+* Ring links:
 *******************************************************************************
 
-Ring transceiver medium access (MAC):
-Use Ethernet per transceiver link.The Ethernet MAC provides link establishment, so it uses a full duplex transceiver. The
-Ethernet packet header contains destination MAC address, source MAC address and Ethernet type. The Ethernet packet tail
-contains a CRC. The CRC provides data error detection. No need to use UDP/IP and ARP, because the links in the ring are
-point to point and will not be used in a network. The Ethernet fields can be used as:
- - Destination MAC = destination PN index
- - Source MAC = source PN index
- - Ethernet type = packet type
+OSI 1 Phyisical layer: Transceivers
+
+OSI 2 Data link layer:
+Use Ethernet per transceiver link.The Ethernet MAC provides link establishment, so it uses a full
+duplex transceiver. The Ethernet packet header contains destination MAC address, source MAC
+address and Ethernet type. The Ethernet packet tail contains a CRC. The CRC provides data error
+detection. There is no need to use UDP/IP and ARP, because the links in the ring are point to
+point and will not be used in a network. The Ethernet fields can be used as:
+
+- Destination MAC = destination PN index
+- Source MAC = source PN index
+- Ethernet type = packet type
+
 Design decision: Use Ethernet for the ring transceiver links
 
-Ring application packet types:
+
+Use 10GbE or 40GbE:
+From the low-latency Ethernet core user guides it follows that the Ethernet core with statistics
+registers use:
+
+- 10GbE core :  4300 FF,  4 M9K
+- 40GbE core : 21200 FF, 13 M20K
+
+The synhesis fitter results from Apertif BF and XC show that the tech_eth_10g takes about 5500 FF 
+and 4 (BF) or 7 (XC) M9K.The BF MAC has no statistics, the XC MAC does have statistics.
+Hence the 40GbE core is about a factor 4 larger than the 10GbE core, so from a resource usage point
+of view it is does not matter whether we use 4  x 10GbE  or 1 x 40GbE. The advantage of 40GbE is
+that it can fit data rates > 10Gbps per data type stream. The advantage of using 10GbE is that we
+can use one link per data type stream and thereby avoid having to multiplex different data streams
+onto the same 40GbE link. However some multiplexing of local packets and remote transit packets can
+also be needed. UniBoard2 has been tested with 10GbE but not yet with 40GbE.
+The Arria10 on UniBoard2 has 1708800 FF so 1708800 / 182400 = 9.3 times more than the Stratix IV
+on UniBoard1. On UniBoard2 one 10GbE interface uses maximum about 5500 / 1708800 = 0.32 % of the
+FF and maximum about 7 / 2713 = 0.25% of the block RAM. In total there will be 4 x 10GbE for the
+intra board ring, 4 x 10GbE for the inter board ring and 1 x 10GbE for external IO, so these will
+take about 3% of the FF and block RAM resources.
+The packet rate is f_sub = 195312.5 Hz. At 10GbE this means that the maximum packet size is
+10e9/195312.5 = 6400 octets. For oversampled subbands the maximum packet size drops to about
+6400 / 1.25 = 5120 octets. If the minimum packet size is e.g. 4000 octets, then at 10GbE this
+means that the link cannot be fully used, whereas at 40GbE multiple packets will still fit. The 
+maximum packet size for 10GbE also depends on the number of packets on the ring:
+. With one packet on the ring the maximum packet size for 10GbE is 5120 (R_os = 1.25) octets,
+. With N = 16 nodes and all nodes sending to the same end node the maximum packet size is 5120/16
+  = 320 (R_os = 1.25) octets,
+. If the packet only needs to travel N/2 nodes then the maximum packet size is 5120/8 = 640
+  (R_os = 1.25) octets.
+
+Design descision: Assume the ring will use 4 x 10GbE, because it is known technology and suitable.
+
+
+Internally in the FPGA the 10GbE data on the ring interface is available as 64 bit data at 
+156.25 MHz (64 * 156.25M = 10G). 
+
+
+Ring application Ethernet packet types:
 The ring is used for the following application packet types:
 
 - 0x10FB for beamlets,
@@ -63,136 +121,235 @@ The ring is used for the following application packet types:
 - 0x10FD for subband offload,
 - 0x10FE for transient buffer read out
 
-The packet type information can be transported via the Ethernet type field or via an UDP port number. If each link
-is only used for one kind of packet type, then the packet type is only used for information, because the PN
-already knows the packet type. The packet type value is based on packet types that were defined in RSP, where
-0x10FA was used to identify M&C data (0x10FA ~= LOFAR) and the other type values just increment the 0x10FA value.
+The packet type information can be transported via the Ethernet type field or via an UDP port
+number. If each lane is only used for one kind of packet type, then the packet type is only used
+for information, because the PN already knows the packet type. The packet type value is based
+on packet types that were defined in RSP, where 0x10FA was used to identify M&C data (0x10FA ~=
+LOFAR) and the other type values just increment the 0x10FA value.
+
 Design decision: Transport application packet type via Ethernet type field for information
 
 
-Use UDP/IP/ETH or only ETH on the ring:
-We already have a UDP offload component that supports UDP/IP/ETH, but a similar component that only supports ETH is
-easily derived from it. With an UDP the LOFAR packet type information can be transported via the UDP port field.
-Using UDP/IP makes it easier to send the data to a PC for monitoring purposes, however it is also possible to sniff
-raw Ethernet packets on a PC. Using a PC to verify the ring allows capturing large amounts of data. On an FPGA we
-can use a data buffer to sniff the packets, but only a few.
-The extra overhead of UDP = 8 octets and IP = 20, so 28 octets in total. The disadvantage of using UDP/IP is that
-it adds some extra traffic overhead and uses some extra logic resources, but that could be acceptable. The
-disadvantage of verifying the ring using a PC are:
+
+OSI 3 Network layer: Use ring
+
+Wormhole routing (or cut-through routing) or store-and-forward routing:
+With worm hole routing a received packet or a received and modified packet is already
+transmitted, while the tail of the packet is still being received. The advantage of wormhole
+routing is that it minimizes the latency along the ring and therefore also local buffering to
+align between local and remote data. The disadvantage of wormhole routing is that a CRC error
+on the received packet needs to be propagated by forcing the CRC of the transmitted packet to be
+wrong. This implies that all subsequent hops will show this CRC error. For link diagnoses this
+is confusing, because the subsequent links did not cause the CRC error. With store-and-forward
+routing a packet is first received entirely before it is passed on for transmit. This allows to
+discard a received packet with a CRC error, but does increase the latency on the ring. Packets
+with a CRC error cannot be allowed to enter the processing in the node, because any bit in the
+packet may be corrupted, especially in the packet header, so no meaningfull processing is
+possible.
+
+Design descision:
+  For LOFAR 2.0 choose to use store-and-forward, because it allows discarding packets with CRC
+  errors when they occur and because there is sufficient internal block RAM to buffer the local
+  data for the worst case ring latency.
+
+
+Only accept correct packets:
+Discard all packets that have a CRC error. This also prevents that packets of wrong length enter
+the internal processing. The Ethernet CRC error is 32 bit, so it is very unlikely that packet with
+errors still has a correct CRC. With wormhole routing it was necessary to limit or extend a packet
+to a known fixed length, because also packets with CRC error are passed on. With store-and-forward
+routing the CRC provides sufficient protection to ensure that only correct packets enter the
+application.
+
+
+Ring latency:
+The latency of 1 hop is about 0.2 us. The time to transmit one Ethernet frame of 1500 octets at
+10Gbps is about 1.2 us and a jumbo frame of 6400 octets takes about 5.12 us (= T_sub). Hence for
+packets >~ 300 octets the ring latency is dominated by the store-and-forward routing at each node.
+The 10GbE Ethernet MAC uses 64 bit data. At 200 MHz this can achieve 64 * 0.2 = 12.8 Gbps. Hence
+if the processing operates without data valid gaps, then the Ethernet transmit will not run empty
+during a payload. Therefore it is not necessary to use a fill FIFO, which would add to the ring
+latency. For a packet that travels the entire ring the latency is then about (N-1) * T_sub and
+the corresponding FIFO depth to align the local data with this remote data is (N-1) * packet size.
+
+
+
+OSI 4 Transport layer: Use UDP/IP/ETH or only ETH on the ring:
+We already have a UDP offload component that supports DP/UDP/IP/ETH, but a similar component that
+only supports DP/ETH is easily derived from it. With an UDP the LOFAR packet type information can
+be transported via the UDP port field. Using UDP/IP makes it easier to send the data to a PC for
+monitoring purposes, however it is also possible to sniff raw Ethernet packets on a PC. Using a 
+PC to verify the ring allows capturing large amounts of data. On an FPGA we can use a data buffer
+to sniff the packets, but only a few. The extra overhead of UDP = 8 octets and IP = 20, so 28
+octets in total. The disadvantage of using UDP/IP is that it adds some extra traffic overhead and
+uses some extra logic resources, but that could be acceptable. The disadvantage of verifying the
+ring using a PC are:
+
 - between FPGAs on the same UniBoard the ring can only be observed on the FPGA
-- the ring will only connect FPGAs in the application, so using a PC is a side track that as such may cause extra
-  work.
-Using UDP/IP does not make it possible to replace the ring by a switch without modifications, so changing from a
-ring based design to a switch based design will still imply a redesign of the data transport scheme.
-Design decision: Use raw Ethernet and verification on FPGA, because that fits the ring (especially between FPGAs
-                 on UniBoard2) and avoids the extra overhead of UDP/IP.
-
-Ring application header:
-The packet payload needs to have an application header to carry the timestamp and a stream identifier. This
-information can be tranported via the DP packet header which has a BSN field and a channel field. The BSN is the
-timestamp. The channel field can carry the source PN index and destination PN index. These PN indices are also
-available in the ETH source and destination MAC addresses of ETH encoded packets, but they also need to be
-available in ETH decoded packets. In ETH encoded packets the destination MAC address allow direct pass on of
-transit packets on the ring, without having to ETH decode them. In ETH decoded packets the BSN and channel fields
-can be passed along inside the encoded DP packet or in parallel with the decoded DP packet application data. The
-channel information can be used to process the remote packets in parallel e.g. per source PN index.
-
-
-
-What is the ETH packet overhead?
-The ETH packet overhead consists of:
-. Add  8 octets (c_network_eth_preamble_len) for Ethernet preamble
-. Add 14 octets for the ETH header that contains destination MAC (6), source MAC (6) and Ethernet type (2)
-. Add  2 octets to pad the ETH header to align to 8 byte word boundary
-. Add  4 octets for CRC
-. Add 12 octets (c_network_eth_gap_len) for Ethernet gap size between packets
-  = 8 + 14 + 2 + 4 + 12 = 40 octets
+- the ring will only connect FPGAs in the application, so using a PC is a side track that as such
+  may cause extra work.
+  
+Using UDP/IP does not make it possible to replace the ring by a switch without modifications, so
+changing from a ring based design to a switch based design will still imply a redesign of the
+data transport scheme.
+
+Design decision:
+  Use raw ETH and verification on FPGA, because that fits the ring (especially between FPGAs on
+  UniBoard2) and avoids the extra overhead of UDP/IP.
+
+
+Ring application header DP/ETH:
+The packet payload needs to have an application header to carry the timestamp and a stream
+identifier. This information can be tranported via the DP packet header which has a BSN field and
+a channel field. The BSN is the timestamp. The channel field can carry the source PN index and
+destination PN index. These PN indices are also available in the ETH source and destination MAC
+addresses of ETH encoded packets, but they also need to be available in ETH decoded packets. In
+ETH encoded packets the destination MAC address allow direct pass on of transit packets on the
+ring, without having to ETH decode them. In ETH decoded packets the BSN and channel fields can be
+passed along inside the encoded DP packet or in parallel with the decoded DP packet application
+data. The channel information can be used to process the remote packets in parallel e.g. per
+source PN index. The channel information can also provide flagging information, to e.g. identify
+filler packets.
+
+Design decision:
+  Use DP/ETH. Together the CP CRC and ETH CRC ensure that for the lifetime of LOFAR2.0 packets
+  with correct CRC will not have false positives. Use a bit in the channel field to indicate
+  filler packets.
+
+
+What is the DP/ETH packet overhead?
 
+- The ETH packet overhead consists of:
+  . Add  8 octets (c_network_eth_preamble_len) for Ethernet preamble
+  . Add 14 octets for the ETH header that contains destination MAC (6), source MAC (6) and
+    Ethernet type (2)
+  . Add  2 octets to pad the ETH header to align to 8 byte word boundary
+  . Add  4 octets for CRC
+  . Add 12 octets (c_network_eth_gap_len) for Ethernet gap size between packets
+    = 8 + 14 + 2 + 4 + 12 = 40 octets
+
+- The DP packet overhead consists of (dp_packet_enc_crc / dp_packet_dec_crc):
+  . Add 4 octects for CHAN (32b)
+  . Add 8 octects for Sync & BSN (64b)
+  . Add 4 octects for ERR (32b)
+  . Add 4 octects for CRC (32b)
+    = 4 + 8 + 4 + 4 = 20 octets
+
+Design decision: The DP/ETH packet overhead is P_overhead = 60 octets.
+
+
+Use one packet type per ring lane:
+This avoids having to multiplex different packet types onto a single lane. Still the Ethernet type
+can be used to fill in the packet type to more easily identify data on different lanes of the ring.
 
 How many transceivers are needed for the ring?
-There are four data types beamlets, crosslets, subband offload and transient buffer read out. The data loads are:
-- 488 beamlets (R_os = 1 --> W_beamlet_sum = 24 bit, R_os = 1.25 --> W_beamlet_sum = 19.2 ~= 20 bit)
+The ring uses 4 of the 12 available transceivers, to match the QSFP cable link that is needed to
+connect the ring between UniBoard2.
+There are four data types beamlets, crosslets, subband offload and transient buffer read out. The
+data loads are:
+- 488 beamlets (R_os = 1 --> W_beamlet_sum = 18 bit, R_os = 1.28)
 - ~10 crosslets (R_os = 1 --> 15 crosslets, R_os = 1.25 --> 12 crosslets)
-- ~    subbands (R_os = 1
-- 
+-     subbands (R_os = 1
+- << 10Gbps transient buffer data
+
+
+Link monitoring:
+The link should be monitored during normal operation and to avoid the need to define and control a
+test packet (e.g. like ping). The link monitoring should directly identify the source of a error
+(e.g. tx node, link, rx node).
+Design decision: Use DP/ETH packets to monitor the link quality.
+
 
-Choose to transport one data type packet per 10GbE link direction. 
+*******************************************************************************
+* Ring usage:
+*******************************************************************************
+
+OSI 5 Session layer:
+OSI 6 Presentation layer:
+OSI 7 Application layer:
 
-The ring can be used in both directions. The forward direction is e.g. from PN0 to 15, the backward direction is e.g.
-from PN 15 to 0. The ring uses 4 of the 12 available transceivers, to match the QSFP cable link that is needed to connect
-the ring between UniBoard2.
 
 The ring function has the following sub functions:
 - Receive packets from ring (and remove CRC field)
 - Discard incorrect packets (based on CRC)
-- Pass on transit packets (Destination MAC > PN index for forward ring, MAC < PN index for backward ring)
+- Pass on transit packets (Destination MAC > PN index for forward ring, MAC < PN index for backward
+  ring)
 - Decode packets (get packet from ring for internal use)
 - Encode packets (put internal packet onto ring)
 - Multiplex local and transit packets
 - Transmit packets onto ring
+- Monitor Rx and Tx packets
+- Align packets for processing (use filler data on inputs with lost packets)
 
 
-Use 10GbE or 40GbE:
-From the low-latency Ethernet core user guides it follows that the Ethernet core with statistics registers use:
- 10GbE core :  4300 FF,  4 M9K
- 40GbE core : 21200 FF, 13 M20K
-The synhesis fitter results from Apertif BF and XC show that the tech_eth_10g takes about 5500 FF and 4 (BF) or 7 (XC) M9K.
-The BF MAC has no statistics, the XC MAC does have statistics.
-Hence the 40GbE core is about a factor 4 larger than the 10GbE core, so from a resource usage point of view it is does not matter
-whether we use 4  x 10GbE  or 1 x 40GbE. The advantage of 40GbE is that it can fit data rates > 10Gbps per data type stream. The
-advantage of using 10GbE is that we can use one link per data type stream and thereby avoid having to multiplex different data 
-streams onto the same 40GbE link. However some multiplexing of local packets and remote transit packets can also be needed. 
-UniBoard2 has been tested with 10GbE but not yet with 40GbE.
-The Arria10 on UniBoard2 has 1708800 FF so 1708800 / 182400 = 9.3 times more than the Stratix IV on UniBoard1. On UniBoard2 one
-10GbE interface uses maximum about 5500 / 1708800 = 0.32 % of the FF and maximum about 7 / 2713 = 0.25% of the block RAM.
-In total there will be 4 x 10GbE for the intra board ring, 4 x 10GbE for the inter board ring and 1 x 10GbE for external IO, so
-these will take about 3% of the FF and block RAM resources.
-The packet rate is f_sub = 195312.5 Hz. At 10GbE this means that the maximum packet size is 10e9/195312.5 = 6400 octets. For
-oversampled subbands the maximum packet size drops to about 6400 / 1.25 = 5120 octets. If the minimum packet size is e.g. 4000
-octets, then at 10GbE this means that the link cannot be fully used, whereas at 40GbE multiple packets will still fit. The 
-maximum packet size for 10GbE also depends on the number of packets on the ring:
-. With one packet on the ring the maximum packet size for 10GbE is 5120 (R_os = 1.25) octets,
-. With N = 16 nodes and all nodes sending to the same end node the maximum packet size is 5120/16 = 320 (R_os = 1.25) octets,
-. If the packet only needs to travel N/2 nodes then the maximum packet size is 5120/8 = 640 (R_os = 1.25) octets.
-Design descision: Assume the ring will use 4 x 10GbE, because it is known technology and suitable.
+Ring access schemes:
 
+- 1) start node sends packet to end node, intermediate nodes modify the packet.
+- 2a) each node starts sending its packets to an end node, intermediate nodes pass on the packet
+- 2b) each node starts sending its packets to an end node, intermediate nodes pass on the packet
+      and use the packet (= multi cast)
+
+If both scheme 1 and 2 are suitable, then scheme 1 typically yields a larger payload, because it
+reserves slots for all nodes, whereas the payload for scheme 2 only contains data from one node.
+Scheme 1 and 2b are useful if the transit nodes also use or modify the packet data. The multiple
+hops are then used to multi cast the data. Scheme 2a is suitable for packet transport from start
+to end node, whereby transit nodes only pass on the packet.
+
+For the beamformer beamlets scheme 1 is most suitable. The start node prepares the packet with
+the initial beamlet sums. The subsequent nodes add there local beamlet sum to the packet
+beamlet sums and then pass on the packet.
+
+For the subband correlator both scheme 1 and scheme 2b are suitable. For scheme 1 the start node
+creates a packet with slots for all nodes and fills in its own slot with its crosslets. Scheme 1
+was used in LOFAR 1.0. The subsequent nodes fill in their slots with their crosslets and also
+use the packets to correlate the remote crosslets with their local crosslets. With scheme 2b
+each node creates a packet with its own crosslets and sends it to N/2 nodes further. The
+intermediate node pass on or remove the packets and use the packets to correlate the remote
+crosslets with their local crosslets. The disadvantage of scheme 1 is that it requries a 
+dedicated start node that initiates the aggregate packet. With scheme 2b each node acts as start
+node for its own packet. Intermediate nodes use the remote packets for correlation and pass
+them on. The final destination node removes the packet.
+
+For the subband offload both scheme 1 and scheme 2a are suitable. For scheme 1 the start node
+creates a packet with slots for all nodes and fills in its own slot with its subbands. The
+subsequent nodes fill in their slots with their subbands. With scheme 2a each node creates a
+packet with its own subbands and sends it to the output end node. The other nodes only pass on
+the remote packets.
+
+For transient buffer read out scheme 2a is most suitable to gather the read out data from each
+node at the output end node.
 
-Use one packet type per ring link.
-This avoids having to multiplex different packet types onto a single link. Still the Ethernet type can be used to fill
-in the packet type to more easily identify data on different links of the ring.
 
-Use application packets to monitore the link quality:
-This allows monitoring the link during normal operation and avoids the need to define and control a test packet (e.g.
-like ping).
+Ring access directions:
+The ring can be used in both directions. The forward direction is e.g. from PN0 to 15, the
+backward direction is e.g. from PN 15 to 0 for N = 16 nodes.
+All schemes can be used in two directions for the same type of data transport. In one direction
+the maximum number of hops between start and end node is N-1, while by using both directions the
+maximum number of hops between start and end node is N/2. If the data is used on all intermediate
+nodes, then there is no advantage to use the ring in both directions. If the data is only passed
+along by intermediate nodes, then the link capacity is used about a factor two more efficiently
+by sending data in both directions. Disadvantages of using the ring in both directions for the
+same type of data are that each node needs to decide which direction to use, that the data arrives
+from both directions at the end node, and that it is somewhat more difficult to understand and
+diagnose. 
 
-Wormhole routing or store-and-forward routing:
-With worm hole routing a received packet or a received and modified packet is already transmitted, while the tail of 
-the packet is still being received. The advantage of wormhole routing is that it minimizes the latency along the ring
-and therefore also local buffering to align between local and remote data. The disadvantage of wormhole routing is
-that a CRC error on the received packet needs to be propagated by forcing the CRC of the transmitted packet to be 
-wrong. This implies that all subsequent hops will show this CRC error. For link diagnoses this is confusing, because
-the subsequent links did not cause the CRC error. With store-and-forward routing a packet is first received entirely
-before it is passed on for transmit. This allows to discard a received packet with a CRC error, but does increase the
-latency on the ring. For LOFAR 2.0 choose to use store-and-forward, because it allows discarding packets with CRC
-errors when they occur and because there is sufficient internal block RAM to buffer the local data for the worst case
-ring latency.
+Design decision : Therefore choose to use the ring in only one direction per link.
 
-Ring latency:
-The latency of 1 hop is about 0.2 us. The time to transmit one Ethernet frame of 1500 octets at 10Gbps is about 1.2 us
-and a jumbo frame of 6400 octets takes about 5.12 us (= T_sub). Hence for packets >~ 300 octets the ring latency is
-dominated by the store and forward routing at each node. The 10GbE Ethernet MAC uses 64 bit data. At 200 MHz this can
-achieve 64 * 0.2 = 12.8 Gbps. Hence if the processing operates without data valid gaps, then the Ethernet transmit
-will not run empty during a payload. Therefore it is not necessary to use a fill FIFO, which would add to the ring 
-latency. For a packet that travels the entire ring the latency is then about (N-1) * T_sub and the corresponding 
-FIFO depth to align the local data with this remote data is (N-1) * packet size.
+
+Use one link per packet type:
+For scheme 2 use only one link for all source nodes, so do not let different source nodes use
+different links. For N/2 = 8 or N = 16 the number of links would become too large. By using one
+link for all sources, increasing the processing becomes a matter of using and instantiating more
+links.
 
 
-Only accept correct packets:
-Discard all packets that have a CRC error. This also prevents that packets of wrong length enter the internal
-processing. The Ethernet CRC error is 32 bit, so it is very unlikely that packet with errors still has a 
-correct CRC. With wormhole routing it was necessary to limit or extend a packet to a known fixed length, because
-also packets with CRC error are passed on. With store-and-forward routing the CRC provides sufficient protection
-to ensure that only correct packets enter the application.
+Remote and local data alignment:
+In APERTIF the data arrived from >= 2 remote streams. With the LOFAR ring there is always local
+data that arrives first and needs to be aligned with only one remote data stream. The local data
+needs to be buffered until the remote data from the farthest PN has arrived. The latency on the
+ring is about 1 packet per transit hop, due to the store-and-forward. The first hop has negligible
+latency. Hence with H hops the local data buffer size needs to be (H-1) * local data size.
+
 
 Ring data transport schemes:
   - beamlets on ring: l --> r+l --> r+l --> ... --> r+l
@@ -200,13 +357,10 @@ Ring data transport schemes:
     . output filler data if remote got lost, to preserve nominal output rate to CEP
     
   - crosslets on ring:  rrrrrrrr,l --> rrrrrrrr,l --> ... --> rrrrrrrr,l
-    . on each node separately align N/2 pairs of inputs l,r, have one pair per XC cell
-    or
-    . on each node first align all inputs l,N/2*r, and then split into N/2 pairs of l,r to have one pair per XC cell
-    . discard output data if remote got lost, to count number of active blocks per integration sync interval
-      or
-      output filler data if remote got lost, and use zero to not disturb the intergation and count unflagged blocks
-      to know the number of active blocks per integration sync interval
+    . on each node first align all inputs l,N/2*r, and then split into N/2 pairs of l,r to have one
+      pair per XC cell (or on each node separately align N/2 pairs of inputs l,r, have one pair per
+      XC cell). output filler data if remote got lost, and use zero to not disturb the intergation
+      and count unflagged blocks to know the number of active blocks per integration sync interval.
     
   - subbands on ring: l, rl, rrl, rrrl, ..., rrrrrrrrrrrrrrrl
     . on final node align all l,(N-1)*r inputs
@@ -216,56 +370,6 @@ Ring data transport schemes:
     . no align, readout from one node at a time
 
 
-Ring access schemes:
-
-- 1) start node sends packet to end node, intermediate nodes modify the packet.
-- 2a) each node starts sending its packets to an end node, intermediate nodes pass on the packet
-- 2b) each node starts sending its packets to an end node, intermediate nodes pass on the packet and use the packet (= multi cast)
-
-If both scheme 1 and 2 are suitable than scheme 1 typically yields a larger payload, because it reserves slots for all
-nodes, whereas the payload for scheme 2 only contains data from one node. Scheme 1 and 2b are useful if the transit nodes
-also use or modify the packet data. Scheme 2a is suitable for packet transport from start to end node, whereby transit
-nodes only pass on the packet.
-
-For the beam former beamlets scheme 1 is most suitable. The start node prepares the packet with the initial beamlet sums.
-The subsequent nodes add there local beamlet sum to the packet beamlet sums and then pass on the packet.
-
-For the subband correlator both scheme 1 and scheme 2b are suitable. For scheme 1 the start node creates a packet with
-slots for all nodes and fills in its own slot with its crosslets. Scheme 1 was used in LOFAR 1.0. The subsequent nodes fill in
-their slots with their crosslets and also use the packets to correlate the remote crosslets with their local crosslets.
-With scheme 2b each node creates a packet with its own crosslets and sends it to N/2 nodes further. The intermediate node
-pass on the packets and use the packets to correlate the remote crosslets with their local crosslets.
-
-For the subband offload both scheme 1 and scheme 2a are suitable. For scheme 1 the start node creates a packet with slots for all
-nodes and fills in its own slot with its subbands. The subsequent nodes fill in their slots with their subbands. With scheme 2a
-each node creates a packet with its own subbands and sends it to the output end node. The other nodes only pass on the remote packets.
-
-For transient buffer read out scheme 2a is most suitable to gather the read out data from each node at the output end node.
-
-
-Ring access directions:
-All schemes can be used in two directions for the same type of data transport. In one direction the maximum number
-of hops between start and end node is N-1, while by using both directions the maximum number of hops between start
-and end node is N/2. If the data is used on all intermediate nodes, then there is no advantage to use the ring in
-both directions. If the data is only passed along by intermediate nodes, then the link capacity is used
-about a factor two more efficiently by sending data in both directions. Disadvantages of using the ring in both
-directions for the same type of data are that each node needs to decide which direction to use, that the data
-arrives from both directions at the end node, and that it is somewhat more difficult to understand and diagnose. 
-Design decision : Therefore choose to use the ring in only one direction per link.
-
-Use one link per packet type:
-For scheme 2 use only one link for all source nodes, so do not let different source nodes use different links. For
-N/2 = 8 or N = 16 the number of links would become too large. By using one link, increasing the processing becomes
-a matter of using and instantiating more links.
-
-
-Remote and local data alignment:
-In APERTIF the data arrived from >= 2 remote streams. With the LOFAR ring there is always local data that arrives
-first and needs to be aligned with only one remote data stream. The local data needs to be buffered until the remote
-data from the farthest PN has arrived. The latency on the ring is about 1 packet per transit hop, due to the store
-and forward. The first hop has negligible latency. Hence with H hops the local data buffer size needs to be (H-1) *
-local data size. When the remote data arrive the local data is popped from the buffer. It the remote data has not
-arrived in time, then the local data is popped from the buffer when the next local data is pushed into the buffer.
 
 
 *******************************************************************************
@@ -273,139 +377,211 @@ arrived in time, then the local data is popped from the buffer when the next loc
 *******************************************************************************
 
 What is the beamlet packet size?
-The beamlet sum is passed on along the ring from start PN to end PN using ring access scheme 1. At the end PN the
-final beamlet sum is scaled to W_beamlet = 8 bit and output to CEP. The intermediate beamlet sum has W_beamlet =
-18 bit and is complex. There are N_pol * S_sub_bf = 2 * 488 = 976 beamlets per packet. The payload size is
-N_pol * S_sub_bf * N_complex * W_beamlet_sum / W_byte = 2 * 488 * 2 * 18 / 8 = 4392 octets. The effective packet
-size is 40 + 4392 = 4432 octets. With f_sub = 195312.5 Hz and R_os = 1.28 the data rate is 4432 * 195312.5 * 1.28
-* 8 = 8.864 Gbps, which fits on a 10GbE link.
+The beamlet sum is passed on along the ring from start PN to end PN using ring access scheme 1. At
+the end PN the final beamlet sum is scaled to W_beamlet = 8 bit and output to CEP. The intermediate
+beamlet sum has W_beamlet = 18 bit and is complex. There are N_pol * S_sub_bf = 2 * 488 = 976
+beamlets per packet. The payload size is N_pol * S_sub_bf * N_complex * W_beamlet_sum / W_byte =
+2 * 488 * 2 * 18 / 8 = 4392 octets. The effective packet size is 60 + 4392 = 4452 octets. With
+f_sub = 195312.5 Hz the data rate is 4452 * 195312.5 * 8 = 6.95625 Gbps < L_lane = 7.8125, so it 
+fits on a 10GbE lane.
 
 Packet decoding and encoding:
-The start node encodes the packet and the end node decodes the packet. The intermediate nodes could operate on
-the encoded packet, however the payload beamlets are packed into bytes and are not word aligned. Therefore the
-intermediate nodes also need to decode the packet to be able to update the payload data, and then encode the 
-packet. The decode and encode function is available in any node, because all nodes run the same firmware image.
-Therefore the decoding and encoding at intermediate nodes can reuse the encoding function of the start node and
-the decode function of the end node, so no extra logic is needed.
+The start node encodes the packet and the end node decodes the packet. The intermediate nodes could
+operate on the encoded packet, however the payload beamlets are packed into bytes and are not word
+aligned. Therefore the intermediate nodes also need to decode the packet to be able to update the
+payload data, and then encode the packet again. The decode and encode function is available in any
+node, because all nodes run the same firmware image. Therefore the decoding and encoding at
+intermediate nodes can reuse the encoding function of the start node and the decode function of the
+end node, so no extra logic is needed.
 
 Ring adder payload processing:
-The station beam is a dual polarization beam and each beam has S_sub_bf = 488 beamlets, so in total there are 
-976 complex beamlets per subband period of N_fft = 1024 cycles @ 200 MHz. For an oversampled filterbank with
-R_os = 4/3 there are N_fft / R_os = 768 cycles @ 200 * R_os MHz. Hence to be compatible with an oversampled
-filter bank the beamformer cannot process all 976 beamlets in series, instead it has to apply ceil(R_os) = 2
-streams in parallel that each process 488 beamlets. Therefore to support the oversampled beamlets the paylaod
-needs to be encoded from and decoded to two streams of beamlets:
-
-  0 : 0 2 4 ............. 974
-  1 : 1 3 5 ............. 975
-  
-The 10Gbps data on the ring interface is available as 32 bit data at 312.5 MHz (32 * 312.5M = 10G). 
+The full band station beam has S_sub_bf = 488 beamlets per polarization, so in total there are 
+N_pol * S_sub_bf = 2 * 488 = 976 complex beamlets per subband period of N_fft = 1024 cycles @
+200 MHz. For an oversampled filterbank with R_os > 1 the processing rate is increased to
+200 * R_os MHz, so there are still N_fft = 1024 cycles available to process 976 beamlets. The ring
+adder adds the local beamlet sum to the received beamlet sum and passes on the result. The beamlet
+sum is received as a packet with 64 bit packed data at 156.25 MHz (64 * 156.25M = 10G). The 976
+beamlets fit in 976 * 18b * 2 / 64b = 549 64b words. The packet is processed at 200 * R_os MHz.
+
+  . from 10GbE -->
+  . Rx packet 64b @ 156MHz --> Rx FIFO to dp_clk domain -->
+  . Rx packet 64b @ 200MHz --> DP/ETH decode to discard or extract payload of 549 words-->
+  . Rx payload 64b @ 200MHz --> repack 549 words to 976 beamlets -->
+  . Align remote and local beamlets -->
+  . Sum remote and local beamlets --> repack 976 beamlets to 549 words -->
+  . Tx payload 64b @ 200MHz --> DP/ETH encode to add header and tail -->
+  . Tx packet 64b @ 200MHz --> Tx FIFO to tx_clk domain -->
+  . Tx packet 64b @ 156MHz --> 
+  . to 10GbE
+
+
+? Does align belong to ring or to beamlet ring adder?
+--> to beamlet ring adder:
+    - to avoid having an align input and output on the ring interface.
+    - implies that align monitor also belongs to beamlet ring adder
+? Does sum belong to ring or to beamlet ring adder or to local beamformer?
+--> to beamlet ring adder:
+    - it deserves a dedicated block, because it is art of the BF (so not of the ring) and it only
+      adds (so does not have BF weigths like the local BF).
+
 
 Local beamlet sums FIFO size:
-The local subband data needs to be buffered until the beamlet sum arrives. The last node experiences the largest
-latency, because then the beamlet sum has travelled N-1 hops, each adding about 5888 * 8 / 10G = 4.71 us. The
-total latency for the LBA ring is (16 - 1) * 4.71 us = 70.6 us or about 14 T_sub. With some extra margin assume
-that the last N-1 or N local beamlets need to be buffered. Per PN this yields a FIFO size of N_pol * S_sub_bf *
-N * N_complex * W_subband = 2 * 488 * 16 * 2 * 18 = 562176 bit, which takes about 32 M20k block RAMs.
+The local subband data needs to be buffered until the beamlet sum arrives. The size of the buffer
+is determined by last node, because then the beamlet sum has travelled N-1 hops. For each hop the
+packet is delayed by:
+ - packet encoding
+ - packet transport over the ring
+After each hop the packet is delayed by:
+ - store-and-forward to be able to check the CRC
+ - packet decoding
+ - packet processing
+The store-and-forward causes a latency of one block period (T_sub) per hop and is the dominant
+factor in the latency. During this latency N-1 local blocks need to be buffered. Assume that the
+processing and transport delays are shorter than one block period, so buffering one extra local
+block is sufficient to compensate it. Per PN this yields a FIFO size of N_pol * S_sub_bf * N *
+N_complex * W_subband = 2 * 488 * 16 * 2 * 18 = 562176 bit, which takes about 32 M20k block RAMs.
 
 Ring modes:
 - off
 - local
 - remote
 - combine
-With dp_bsn_align all these modes are supported by enabling/disabling the corresponding inputs.
-
-FIFO flush:
-A FIFO can be flushed by resetting it, but this requires careful control to ensure that the reset is noticed
-in both clock domains, and that the reset is applied in between input packets to avoid that only a tail
-of a packet gets into a FIFO. Therefore in LOFAR 1.0 and APERTIF a FIFO is flushed by reading the packets 
-from it until it is empty. This scheme also allows flushing per packet. The disadvantage of reading the 
-packets and the discard them, is that it takes as long as reading at full speed.
-
-Lost remote packet detection:
-Local FIFO full:
-The local FIFO needs to buffer the local data to be able to align with the remote data. The latency between
-nodes depends on the number of hops. With N = 16 nodes and store and forward packet transport the maximum
-latency will be < N * T_sub. To compensate for this latency the local FIFO needs to be able to store at most
-about N local packets. If the FIFO runs full, then this is an indicator that remote packets got lost and
-then the local FIFO needs to be flushed until it is empty.
-Rx timeout:
-The average packet rate on the ring is f_sub, so within T_sub there should arrive a new packet. If no packet
-arrives within T_sub, then the local FIFO can flush one packet. In this way the local FIFO does not need to
-be flushed until empty and less packets will get lost once the remote packets arrive again. Using Rx timeout
-does rely on that packets fit within a T_sub interval and that every T_sub interval contains at least part
-of a packet, so the actual packet rate must be close to the average packet rate.
-
-
-Remote packets:
-The remote packets drive the ring adder and are processed on arrival. The local packet with the same time stamp
-is already pending in the local beamlets FIFO. If a burst of remote packet gets lost, then the node will 
-notice this because its local beamlets keep arriving and will overflow the local beamlets FIFO. The node will
-read and discard packets from the local beamlets FIFO to make sure that the FIFO does not overflow. If only
-one or a few remote packets got lost, then the node will noticethis during the time stamp alignment, but
-only as soon as the next packet has arrived. This next packet will be ahead of the local packet, so the local
-packets need to be flushed. The node will then read and discard packets from the local beamlets FIFO until it 
-can align the remote and local data. During this realignment process the next remote packet may already arrive
-as well. Therefore the remote packet needs to be buffered, or discarded. Assume the FIFO is flushed by reading
-and then discarding packets from it. The local packets and the remote packets arrive at the same rate. If the
-flushing of the packets goes faster then reading them, because flushing can use all clock cycles. The flushing
-can only catch up if the gaps between packets are large enough. Therefore in LOFAR 1.0 the remote packets were
-discarded during the flushing. This does mean that when one packet gets lost, the flushing will also discard 
-the next packet and some more for as long as it takes to empty the local beamlets FIFO. An alternative would
-be to keep on flushing and discarding remote packets, until the local beamlet FIFO is again ahead of the
-remote packets. Typically packets will get lost rarely or in bursts. In both cases it is fine to just flush
-the local beamlet FIFO until it is empty.
 
-   PN0     PN1     PN2     PN3     PN4   
-t                                        
-0: L0      L1      L2      L3      L4         <-- S_sub_bf = 488 beamlets (dual pol complex) per packet
-     R4      R0      R1      R2      R3  
-       R3      R4      R0      R1      R2
+With dp_bsn_align_v2 all these modes are supported by enabling/disabling the corresponding inputs.
+
 
 The beamformer function has the following sub functions:
 - "Beamlet subband select" : Select S_sub_bf = 488 subbands per signal input
-- "Local beamformer" : Form N_pol * S_sub_bf = 2 * 488 = 976 local beamlet sums for S_pn = 12 signal inputs
+- "Local beamformer" : Form N_pol * S_sub_bf = 2 * 488 = 976 local beamlet sums for
+                       S_pn = 12 signal inputs
 - "Beamlet ring adder" : 
   if start node:
-    - Encode beamlet sums packet to ring
+    - Encode local beamlet sums packet to ring
   else:
-    - Buffer the local beamlet sums for >= N subband intervals
+    - Buffer the local beamlet sums for ~= N subband intervals
     - Decode remote beamlet sums packet from ring
     - Align remote beamlet sums packet and local beamlet sums packet
     - Add local beamlet sums to remote beamlet sums packet
     if transit node:
       - Encode beamlet sums packet to ring
     else:
-      - "Beamlet data output" : Scale and output beamlet sums
-- "Beamlet statistics (BST)": Calculate BST
+      - "Beamlet data output" : On output node scale and output final beamlet sums
+- "Beamlet statistics (BST)": Calculate BST for beamlet sums, output node has final BST
 
 
 *******************************************************************************
 * Subband Correlator
 *******************************************************************************
 
-Crosslet transport scheme:
-Use transport scheme 2b with N/2 hops where every node sends its local crosslets N/2 hops. The remote crosslets
-are correlated with the local crosslets. The remote crosslets arrive in packets from the N/2 preceding nodes.
-First the local crosslets are correlated with themselves and then the local crosslets are kept in a barrel shifter,
-such that they can also be correlated with the remote crosslets that arrive in the packets.
-- count N_int for monitoring
+With transport scheme 1 crosslets from different source nodes are combined into one packet.
+Scheme 2b packs only local crosslets into a packet. Compared to scheme 1, scheme 2b:
+- treats the local crosslets and remote independently
+- has small payload and thus more packet overhead, but the load still fits on a lane
+- has small payload that can be enlarged by transporting more local crosslets, to support
+  a subband correlator with N_crosslets > 1.
 
+Design decision:
+  Use transport scheme 2b with N/2 hops where every node sends its local crosslets N/2 hops,
+  because it is more flexible to have only local crosslets per packet. 
 
-Square correlator cell:
-There are S_pn = 12 local crosslets. A packet contains S_pn = 12 remote crosslets. There are N/2 remote crosslet
-packets. The local crosslets have to be correlated with the local crosslets and with each of the remote crosslet
-packets. The correlation with the local crosslets is a square matrix that yields X_sq = S_pn * S_pn = 144 visibilities.
 
 Number of square correlator cells per PN:
-With N = 16 PN for LBA there are N/2 = 8 remote crosslet packets. Hence together with the local crosslet visibilities
-this yields X_pn = (N/2 + 1) * X_sq = (8 + 1) * 144 = 1296 visibilities per PN.
+There are S_pn = 12 local crosslets. A packet contains S_pn = 12 remote crosslets. There are N/2
+remote crosslet packets. The local crosslets have to be correlated with the local crosslets and
+with each of the remote crosslet packets. The correlation with the local crosslets is a square
+matrix that yields X_sq = S_pn * S_pn = 144 visibilities. For the local-local square correlator
+cell the efficiency is (S_pn * (S_pn+1)) / 2 / X_sq = 54%, but for the N/2 other local-remote
+square correlator cells the efficiency is 100 %. With N = 16 PN for LBA there are N/2 = 8 remote
+crosslet packets. Hence together with the local crosslet visibilities this yields
+X_pn = (N/2 + 1) * X_sq = (8 + 1) * 144 = 1296 visibilities per PN.
+
+
+Number of multipliers per crosslet:
+The subband correlator needs to finished within one subband period, so within N_fft = 1024 clock
+cycles. The X_pn = 1296 visibililies per PN can be caluculated using one complex multiplier if
+the complex multiplier runs at 1296 / 1024 * 200 M > 253 MHz. For an oversampled filterbank with
+R_os <= 1.28 this requires 324 MHz, which is too much. All X_pn = 1296 can be calculated using
+two complex multipliers running at > 161 MHz. However another option is to use one pultiplier 
+per X_sq = 144 visibilities, so one complex multiplier per correlator cell and N/2 + 1 = 9 
+correlator cells in parallel. The FPGA has sufficient multipliers to support this scheme and the
+spare capacity of each correlator cell can be used to support a subband correlator with more 
+than 1 subband per integration interval, so N_crosslets > 1.
+
+Design decision:
+  Use 1 + N/2 parallel correlator cells, for the local-local visibilities and for the local-
+  remote visibilitie for each remote source.
+
+
+What is the crosslet packet size?
+With S_pn = 12 signal inputs per PN and one crosslet per signal input there are 12 crosslets per
+packet. A crosslet is a W_crosslet = 16 bit complex value, so 12 * 4 = 48 octets payload, so the
+effective packet size is p_packet = 60 + 48 = 108 octets. The relative packet overhead for single
+crosslet payloads is P_overhead / P_packet = 60 / 108 = 55 %.
+
+Maximum number of crosslets per lane:
+There are f_sub = 195312.5 subbands per s, and the packets have to travel N/2 hops. This yields
+a packet load of P_packet * f_sub * N/2 = (108 * 8b) * 195312.5 * 16 / 2 = 1.35 Gbps. The data
+load of only the payload data is payload size * f_sub * N/2 = (48 * 8b) * 195312.5 * 16 / 2 =
+0.6 Gbps. Hence the small packet size causes a large packet overhead, but is still acceptable,
+since it is < L_lane = 7.8125 Gbps so it fits on a single 10G lane of the ring.
+Multiple local crosslets could be transported via seperate packets, a lane can then fit about 
+7.8125 / 1.35 ~= 5 different crosslets. Packing the local crosslets into a single payload 
+reduces the packet overhead. The maximum number of crosslets per packet follows from 
+(P_overhead + X * 48 * 8b) * f_sub * N/2 < L_lane. For N = 16 this yields X ~=
+(7.8125 Gbps / (16/2) / 195312.5 - 60) / (48 * 8b) = 12. With 12 crosslets the payload size is
+16 * 48 = 576 and the effective packet size is P_packet = 60 + 576 = 636 octets. The relative
+packet overhead for multi crosslet payloads is P_overhead / P_packet = 60 / 636 ~= 9.4%. The
+packet load for multi crosslet payloads is (636 * 8b) * 195312.5 * 16/2 = 7.95 Gbps < 
+L_lane = 7.8125 Gbps, so this just not fits on a 10GbE lane, due to the still significant packet
+overhead. Using X = 11 instead of 12 crosslets per packet yields a total crosslet packet load
+per lane of ((60 + 11 * 48) * 8b) * 195312.5 * 16/2 = 7.35 Gbps, which does fit on a lane.
+
+Design decision: 
+  Pack local crosslets into a single payload if N_crosslets > 1, because then teh packet overhead
+  is much reduced to support transporting more crosslets per lane (11 instead of 5).
+
+  
+Maximum number of crosslets per correlator cell:
+A X_pn correlator cell can correlate N_fft / X_sq = 1024 / 144 = 7 different crosslets frequencies.
+With N = 16 for LBA, there need to be N/2 + 1 = 9 of these X_pn correlator cells in parallel. One
+X_pn correlates the local-local crosslets and the other N/2 X_pn correlates the local-remote
+crosslets. These 9 X_pn in parallel can correlate up to 7 crosslets. The link can transport 
+maximum 11 crosslets. Hence the processing capacity of 9 X_pn is less than the IO capacity of 1
+10GbE lane, therefore 9 X_pn in parallel can correlate 7 different crosslets. The crosslet data rate
+on a lane is then ((60 + 7 * 48) * 8b) * 195312.5 * 16/2 = 4.95 Gbps, so a utilization of 4.95 / 
+7.8125 = 63 %. Another set of 9 X_pn could be used to correlate the remaining 11- 7 = 4 crosslets
+that can be transported via the ring. However, if more than N_crosslet = 7 crosslets need to be
+correlated in parallel per integration, then it is easier to allocate an extra lane and to
+instantiate an extra set of 9 X_pn to correlate 14 crosslets in parallel in total.
+
+One X_pn takes one complex multiplier. For N_crosslets = 1 crosslet per integration interval using
+N/2+1 = 9 X_pn uses only 144 / 1024 = 14% of the processing resources. However this is acceptable 
+because:
+- the FPGA has sufficient multipliers
+- it provides a clear design
+- the spare capacity can be used to process more crosslets per integration interval
+
+Design decision:
+   Use 1 + N/2  = 9 parallel correlator cells to correlate N_crosslets = 1 crosslet, or upto 7
+   crosslets in parallel, per integration interval. 
+  
+
+
+Send more than one time slot per packet?
+To reduce the relative packet overhead for single crosslet XC it is an option to put multiple
+time slots per payload. Design decision: This is considered to complicated.
+
+
+What if a packet gets lost?
+The local crosslets cannot get lost, but remote packets may get lost. The BSN aligner will repolace
+lost remote packets with filler packets that are flagged. The crosslets in the filler packets
+contain zero data, so in the correlator they do not contribute to the visibilities. Each correlator
+cell has to count the number of valid and flagged crosslets per integration interval. The number
+valid crosslets N_valid can be used to weight the visibility relative to the expected number of
+N_int crosslets. The number of flagged crosslet N_flagged is used for monitoring. For every
+integration interval N_valid + N_flagged = N_int, by design of the BSN aligner.
+
 
-Crosslet period:
-The subband correlator needs to finished within one subband period, so T_xc < T_sub. For the critically sampled
-filterbank the subband period is N_fft = 1024 sample periods. The X_pn = 1296 visibililies per PN can be
-caluculated using one complex multiplier if the multiplier runs at 1296 / 1024 * 200 M > 253 MHz. For an oversampled
-filterbank with R_os <= 1.25 this requires 1.25 * 253 = 317 MHz, which may be too much.
 
 Time in diagrams:
 - equal time for all PN in same row and in same relative column
@@ -441,81 +617,11 @@ N_int-1:                                      <-- Dump and restart XST:
                                                   - not calculated because conj()
                                                                                                     
 
-What is the crosslet packet size?
-With S_pn = 12 signal inputs per PN and one crosslet per signal input there are 12 crosslets per packet. A crosslet is
-a W_crosslet = 16 bit complex value, so 12 * 4 = 48 octets payload, so the effective packet size is 40 + 48 = 88 octets.
-The relative packet overhead for single crosslet payloads is 40 / 88 = 45 %.
-
-There are f_sub = 195312.5 subbands per s, and the packets have to travel N/2 hops. This yields a packet load of
-packet size * f_sub * N/2 = (88 * 8b) * 195312.5 * 16 / 2 = 1.1 Gbps. The data load of only the payload data is
-payload size * f_sub * N/2 = (48 * 8b) * 195312.5 * 16 / 2 = 0.6 Gbps. Hence the small packet size causes a large
-packet overhead, but is still acceptable, since it fits on a single 10G link of the ring.
-
-Calculate one or multiple crosslets:
-With small payloads the 10G link could fit about  10/1.1 ~= 8 different crosslets. With larger payloads the 10G link
-could fit about 10 / 0.6 = 16 crosslets. The advantage of using small payloads is that adding more crosslets can be done
-by instantiating the same single crosslets XC multiple times. However the small packets do have to travel sequentially 
-via the same 10G link, so there needs to be a multiplexer after that the local ETH frames have been made. The advantage of
-using larger payloads is that they can be made by putting the extra crosslets in the same payload. With 16 crosslets
-the payload size is 16 * 48 = 768 and the effective packet size is 40 + 768 = 808 octets. The relative packet overhead for 
-multi crosslet payloads is 40 / 808 ~= 5 %. The packet load for multi crosslet payloads is (808 * 8b) * 195312.5 * 16 / 2 =
-10.1 Gbps, so this will just not fit on a 10GbE link, but 15 crosslets would.
-                                                                                   
-At 200 MHz for the critically sampled subbands, a X_pn correlator cell can correlate N_fft / X_sq = 1024 / 144 = 7
-different crosslets frequencies. With N = 16 for LBA there need to be N/2 + 1 = 9 of these X_pn correlator cells in
-parallel. One X_pn correlates the local-local crosslets and the other N/2 X_pn correlates the local-remote crosslets.
-These 9 X_pn in parallel can correlate up to 7 crosslets. The link can transport 15 crosslets, so 18 X_pn in parallel
-could correlate 14 different crosslets to make better use of the link capacity.
-
-One X_pn takes one complex multiplier. For one crosslet using N/2+1 = 9 X_pn is a waste of resources, but still 
-acceptable and providing a clear design.
-
-
-Send more than one time slot per packet?
-To reduce the relative packet overhead for single crosslet XC it is an option to put multiple time slots per payload.
-This is considered to complicating.
-
-   PN0     PN1     PN2     PN3     PN1   
-t                                        
-0: L00     L11     L22     L33     L44        <-- For example two time slots per packet
-      R44     R00     R11     R22     R33 
-         R33     R44     R00     R11     R22
 
-2:
 
-What if a node fails?
-The next N/2 nodes will then miss packets. The order of the packets is not affected, because on each node it will
-be the last one or more packets that are missed. There will be no correlations for the missed packets, but the
-correlation should continue if the next time slot the node starts again. A packet count per packet source at each
-node will reveal missed packets and thus also the number of integrations that happened in the final visibilities.
-If no packets are missed then the packet count is 195312.5 per integration interval on every PN for every packet
-source PN.
-
-   PN0     PN1     PN2     PN3     PN1   
-t                                        
-0: L0      .       L2      L3      L4         <-- PN1 fails, so next N/2 nodes will miss packets
-     R4    .         .       R2      R3  
-       R3  .           .       .       R2
-   00      .       22      33      44
-     04      .       .       32      43
-       03      .       .       .       42
 
        
-What if a packet gets lost?
-If a packet gets lots then it can cause a gap in the packet order, so the next packet must not be mistaken as
-the lost packet. Therefore the packets must have a time slot number and a source number, such that the XST in
-each node will use it for the correct visibilities.
 
-   PN0     PN1     PN2     PN3     PN1   
-t                                        
-0: L0      L1      L2      L3      L4
-     R4      .       R1      R2      R3       <-- L0 from PN0 gets lost at PN1
-       R3      R4      .       R1      R2
-
-Packet order is guarantueed?
-At the start of every time slot the local L# packet is send first. After that each node passes on the packets that 
-it receives. Therefore the packets arrive in order with packet from closest node first and from furtherst node
-last. If a packet gets lost then there will be a gap, but the order is still preserved.
 
 What if T_sq > T_hop latency on ring?
 What if T_sub > N/2 * T_hop latency on ring?
diff --git a/applications/lofar2/doc/prestudy/station2_to_do_erko.txt b/applications/lofar2/doc/prestudy/station2_to_do_erko.txt
index ee6be7b14d..b7862111df 100755
--- a/applications/lofar2/doc/prestudy/station2_to_do_erko.txt
+++ b/applications/lofar2/doc/prestudy/station2_to_do_erko.txt
@@ -182,15 +182,16 @@ git remote remove <remote name> # remove a remote repo
 *******************************************************************************
 Open issues:
 - Central HDL_IO_FILE_SIM_DIR = build/sim --> Project local sim dir
-- avs_eth_coe.vhd per tool version? Because copying avs_eth_coe_<buildset>_hw.tcl to $HDL_BUILD_DIR copies the
-  last <buildset>, using more than one buildset at a time gices conflicts.
+- avs_eth_coe.vhd per tool version? Because copying avs_eth_coe_<buildset>_hw.tcl to $HDL_BUILD_DIR
+  copies the last <buildset>, using more than one buildset at a time gices conflicts.
 
 
 
 *******************************************************************************
 * To do:
 *******************************************************************************
-- Check that the Expert users (MB, SJW, MN), Maintainers (HM) and Local users are happy with the design decisions
+- Check that the Expert users (MB, SJW, MN), Maintainers (HM) and Local users are happy with the
+  design decisions
 - H6 M&C loads section
 - H3 Functions mapping
 - H3/4 Timing (1s default, PPS, event message)
@@ -225,7 +226,7 @@ Open issues:
 - Update RadioHDL docs
 - Write RadioHDL article
 - Write HDL RL=0 article - desp_hdl_design_article.txt
-
+- XST : SNR = 1 per visibility for 10000 samples, brigthtest sourcre log 19.5 --> 4.5 dB --> T_int = 1 s is ok.
 
 
 
-- 
GitLab