Finished draft design for SDP ring. Updated on SDP planning.

3148327c · Eric Kooistra · cef140cc · 3148327c · 3148327c · 3148327c
Commit 3148327c authored Dec 5, 2019 by Eric Kooistra
--- a/applications/lofar2/doc/prestudy/station2_sdp_firmware_planning.txt
+++ b/applications/lofar2/doc/prestudy/station2_sdp_firmware_planning.txt
@@ -80,140 +80,61 @@ No oversampled filterbank:
 * SDP Workpackage (UniBoard2 HW + FW)
 *******************************************************************************
+Changed tasks:
+- T4.6 : 20 weeks booked explicitely for Required documents
+- T4.1 : 10 weeks, because GIT, RadioHDL finished
+- T4.2 : 10 weeks, because some FW done
+- T4.2 : ? weeks, hardware effort
 Firmware FPGA images:
 - the SDP has one main firmware design unb2c_sdp,
 - the integrated design of SDP is revision unb2c_sdp_station,
 - per task there are revisions of unb2c_sdp that contain subsets of the SDP functionality,
-Deliverables (D): items that are needed for a milestone
+Deliverables (D) = an item, product : items that are needed for a milestone
-Milestones (M) : 'cake moments' when you demonstrate deliverables
+Milestones (M) = a moment in time, achievement : 'cake moments' when you demonstrate or review
+                 deliverables as part of a larger system
 - integration passed
 - review passed
+Planning for LOFAR2.0 Station Workpackage 4 : Station Digital Processing
-Tasks:
+Below is the planning in weeks per task, the work includes:
+- UniBoard2 hardware 
+- Firmware that runs on UniBoard2
-INFRASTRUCTURE UniBoard2:
+weeks task   description
-  weeks  nr task
+10    T4.1   Maintain firmware development environment (GIT, RadioHDL, HDL libraries)
-     20  1) Maintain firmware development environment
+10    T4.2   UniBoard2 test firmware (enable mass production of UniBoard2)
-            - using GIT
+ ?    T4.2   UniBoard2 board hardware
-            - using RadioHDL
+20    T4.3   UniBoard2 board support package (BSP, M&C via Gemini Protocol, use ARGS for doc, C, VHDL)
-            - updating existing VHDL library components
+10    T4.4   Network access via 10GbE (support ARP and ping)
-            D=> Operational firmware development environment
+20    t4.5   Ring access using test data and BSN monitor (support ring)
-            D=> VHDL libraries verified in simulation
+20    T4.6   Required documents (SDP RS, detailed design, ICDs, FW manual)
+15    T4.7   ADC input and timestamp (RCU2 interface, capture timestamped data for offline analysis)
+10    T4.8   Subband filterbank (Fsub, critically sampled, SST)
+20    T4.9   Subband correlator (XC, one subband per 1 s integration)
+10    T4.10  Beamformer (BF, BST, beamlet output to CEP)
-         2) UniBoard2 board and test firmware
+10 + 10 + ? + 20 + 10 + 20 + 20 + 15 + 10 + 20 + 10 = 145 + ? weeks
-            - unb2c board HW
-            D=> unb2c board detailed design document
-            D=> unb2c board schematic
-            D=> unb2c board layout
-            M=> unb2c board detailed design document review (unb2b modifications)
+Milestone : SDP ready for CDR:
-            M=> unb2c board schematic review
+All major technical UniBoard2 hardware and SDP firmware risks are mitigated:
-            M=> unb2c board layout review (production ready)
-            M=> unb2c board lab validation using JTAG, unb2c_test designs OK
-            M=> unb2c board production validation using JTAG, unb2c_minimal_gmi OK
-      5     - unb2c FPGA pinning design
+- by design
-     10     - unb2c FPGA interface test designs
+- SDP hardware and interfaces validated with at least two UniBoard2 using JTAG, firmware for BSP,
-            D=> unb2c_test design revisions (1GbE, 10GbE, DDR4, flash, ADC)
+  ring and ADC
-            D=> unb2c_test_adc (read ADC samples from multiple inputs)
+- Station TD validated using BF beamlet output to CEP
+The remaining tasks concern completing the applications that the firmware needs to perform.
-     20  3) UniBoard2 board support package (BSP)
-            - M&C by SCU via Gemini protocol
+weeks task   description
-            - M&C interface definition and generation using ARGS (doc, C, HDL)
+25    T4.11  Transient buffer (TB, ADC data, subband data)
-            D=> Gemini board for SCU M&C tests
+20    T4.12  Transient detection (TDET)
-            D=> unb2c_minimal_gmi (1GbE, flash)
+20    T4.13  Subband offload (SO) for AARTFAAC2.0
-            M=> unb2c_minimal_gmi validated using M&C by SCU (read design name)
+20    T4.14  Station integration tests (using unb2c_sdp_station)
-INFRASTRUCTURE SDP:
-     10  4) Network access via 10GbE
-            - Ethernet MAC, UDP/IPv4, ARP, ping
-            D=> 10GbE HDL component including support for UDP/IPv4, ARP, ping
-            D=> unb2c_10GbE
-            M=> unb2c_10GbE validated using data capture on PC and ping
-     20  5) Ring access using test data and BSN monitor
-            D=> unb2c_ring_combiner for BF
-            D=> unb2c_ring_multicast for XC
-            D=> unb2c_ring_endcast for SO, TB
-            M=> unb2c_ring revisions verified in simulation
-            M=> unb2c_ring revisions validated on hardware using M&C on SCU
-APPLICATION SDP documents:
-         6) Required documents
-            D=> Detailed design document of SDP firmware
-            D=> L1 ICD-11109 SDP-CEP: beamlet data protocol
-            D=> L1 ICD-11109 SDP-CEP: transient data protocol
-            D=> L2 ICD-11211 SC-SDP: FW register map and register definitions
-            D=> L2 ICD-11211 SC-SDP: UniBoard2 hardware M&C
-            D=> L2 ICD-11207 RCU2S-SDP: ADC interface
-            D=> L2 ICD-11209 STF-SDP: Time and frequency interface
-            D=> L2 ICD-11218 SDP-STCA: Subrack interface
-            M=> SDP detailed design and interface documents ready for DDR
-            M=> SDP detailed design and interface documents updated for CDR
-            D=> SDP firmware verification and maintenance document
-            M=> SDP all documents finished
-APPLICATION single node:
-  weeks  nr task
-     15  7) ADC input and timestamp (RCU2 interface)
-            ==> unb2c_sdp_adc_capture, read ADC or WG samples from databuffer via M&C
-            ==> unb2c_sdp_station (ADC)
-M=> SDP ready for CDR
-    All major technical UniBoard2 hardware and SDP firmware risks are mitigated (by design and
-    based on validation with at least two UniBoard2 using JTAG, unb2c_minimal_gmi, unb2c_ring,
-    and unb2c_sdp_adc_capture).
-     10  8) Subband filterbank (Fsub)
-            ==> unb2c_sdp_filterbank to read SST via M&C
-            ==> unb2c_sdp_station (ADC + SST)
-APPLICATION multi node:
-  weeks  nr task
-     20  9) Subband correlator (XC)
-            ==> unb2c_sdp_correlator_one_node, read XST via M&C and create ACM for one node
-            ==> unb2c_sdp_correlator_multi_node, read XST via M&C and use ring to create complete ACM
-            ==> unb2c_sdp_station (ADC + SST + XST)
-APPLICATION multi node / network output:
-  weeks  nr task
-     10 10) Beamformer (BF)
-            ==> unb2c_sdp_beamformer_bst_one_node, read BST via M&C
-            ==> unb2c_sdp_beamformer_output_one_input, output to CEP for one input from one node
-            ==> unb2c_sdp_beamformer_output_one_node, output to CEP and sum one node
-            ==> unb2c_sdp_beamformer_output_multi_node, output to CEP and use ring to sum nodes
-            ==> unb2c_sdp_station (ADC + SST + XST + BST + BF output)
-            ==> detailed design doc
-     25 11) Transient buffer (TB)
-            ==> unb2c_sdp_transient_buffer revisions (ADC + SST + TB readout, M&C access DDR4)
-            ==> unb2c_sdp_station (ADC + SST + XST + BST + BF output + TB readout)
-            ==> detailed design doc
-     20 12) Transient detection (TD)
-            ==> unb2c_sdp_transient_buffer revisions (ADC + TD event)
-            ==> unb2c_sdp_station (ADC + SST + XST + BST + BF output + TB readout + TD event)
-            ==> detailed design doc
-     20 13) Subband offload (SO) for AARTFAAC2.0
-            ==> unb2c_sdp_subband_offload revisions (ADC + SST + SO, one node, all nodes via ring)
-            ==> unb2c_sdp_station (ADC + SST + XST + BST + BF output + TB readout + TD event + SO)
-            ==> detailed design doc
-INTEGRATION:
-  weeks  nr task
-     20 14) Station integration tests (using unb2c_sdp_station)
-            - Laboratory tests
-            - Technical commissioning Dwingeloo Test Station ("Huisje West")
-            - Technical commissioning Prototype Test Station
-            - Technical commissioning Pre-production Test Station
+25 + 20 + 20 + 20 = 85 weeks
--- a/applications/lofar2/doc/prestudy/station2_sdp_ring.txt
+++ b/applications/lofar2/doc/prestudy/station2_sdp_ring.txt
@@ -324,6 +324,15 @@ Scheme 1 and 2b are useful if the transit nodes also use or modify the packet da
 hops are then used to multi cast the data. Scheme 2a is suitable for packet transport from start
 to end node, whereby transit nodes only pass on the packet.
+With scheme 1 each node has a two input BSN aligner that needs to buffer a large packet. With
+scheme 2 the end node has a N input BSN aligner that needs to align N small packets, Even
+though scheme 2 only uses the BSN aligner at the en node, it is there at all nodes, because all
+nodes run the same firmware image. Therefore the resource usage of the BSN aligner will
+typically  not differ much for scheme 1 or 2.
+If one hop fails in scheme 1 then there is no offload. If one hop fails in scheme 2a then there
+is still offload from subsequent hops.
 For the beamformer beamlets scheme 1 is most suitable. The start node prepares the packet with
 the initial beamlet sums. The subsequent nodes add there local beamlet sum to the packet
 beamlet sums and then pass on the packet.
@@ -817,150 +826,204 @@ Assumptions for Station.SDP:
 - The number subbands per lane is independent of set the same for R_os = 1 and R_os = 1.28. This
  implies that the utilization of the lanes for R_os = 1 is about a factor 1.28 less.
-Select S = 96 from S_lba = 192 signal inputs
-AARTFAAC uses the dual pol antennas, so the signal inputs (SI) have to be selected per pair of X
-and Y polarization. The N = 48 antennas can be selected from the N_lba = 96 antennas in different
-either at the offload node or at each PN:
- Transport all SI to the offload node and select there
- Select SI per PN and only transport the selected SI to the offload node
-The first schemeThe selection can be programmable or fixed. 
-First collect all S_lba = 192 signal inputs at the offload node, and then make an arbitrary 
-  selection or a fixed selection. The disadvantage is that this doubles the load on the ring.
- Select at each PN and transport only First collect all S_lba = 192 signal inputs at the offload node, and then make an arbitrary 
-  selection or a fixed selection.
-Use ring transport scheme 1 or scheme2a:
- With scheme 1 the selection of S out of S_lba can be made per PN, as the payload is passed along
-  and each node can insert none, all or a subset of its S_pn at the allocated subband index in the
-  payload. With scheme 2a the selection of S will be done at the offload node, so all PN then send
-  all their S_pn inputs via the ring. This doubles the load on the ring.
- With scheme 2a each node only has to pass on the remote packets, but at the offload node it 
-  needs an N input BSN aligner, an N input to one output subband selection to get the offload
-  payload. With scheme 1 the first node initiates the offload payload and then each node has to
-  insert the local subbands at the correct index. This requires only a two input BSN aligner.
- If one hop fails in scheme 1 then there is no offload. If one hop fails in scheme 2a then there
-  is still offload from subsequent hops.
+Required subband output load for AARTFAAC2.0:
 Current AARTFAAC1 can offload S_sub_so = 32 subbands for S = 96 signal inputs (SI) in W_subband_so
 = 16 bit mode. On the RSP - Uniboard interface there are 9 subbands per lane, so S_sub_so = 36 in
 total, but on the UniBoard - UDP interface to the GPU correlator only 8 subbands, so 32 in total
 are output. The AARTFAAC1 output load is S_sub_so * S * f_sub * N_complex * W_subband_so =
-32 * 96 * 195312.5 * 2 * 16 = 19.2 Gbps. Due to a bug in probably the RSP firmware, W_subband_so
+32 * 96 * 195312.5 * 2 * 16 = 19.2 Gbps. Due to a bug (probably in the RSP firmware), W_subband_so
 = 8 bit mode cannot be supported, but for LOFAR2.0 it can. Hence for the same output load as
 AARTFAAC1, AARTFAAC2.0 can offload S_sub_so = 64 subbands, which corresponds to a bandwidth of
-64 * 195312.5 Hz = 12.5 MHz.
+64 * 195312.5 Hz = 12.5 MHz. Assume the AARTFAAC offload will use 4 10GbE links, so S_sub_lane =
+16 subbands per lane. The payload size is P_payload = S_sub_lane * S * N_complex * W_subband_so /
+W_byte = 64 * 96 * 2 * 8/8 = 3072 octets.
-For LOFAR 2.0 the number of LBA doubles to S_lba = 192, but AARTFAAC2.0 assumes that still S = 96
-will offload subbands. Assume that the S = 96 signal inputs can be selected from the S_lba = 192
+Maximum subband output load per 40Gbps QSFP:
-available signal inputs at the Station output. Therefore internally in the Station SDP the
-subbands from all S_lba are passed on via ring to an the output node in SDP. For the LBA the ring
-in SDP connects N = S_lbs / S_pn = 192 / 12 = 16 nodes, so N-1 hops. Assume all subbands are send
-in one direction along the ring. The subband data load on the last hop is then
-(N-1)/N * 2 * 19.2G = 15/16 * 2 * 19.2G = 36.0 Gbps, excluding packet overhead. Given a lane 
-load capacity of L_lane = 7.8125 Gbps, this implies that the subband offload requires at least 
-ceil(36.0 / 7.8125) = ceil(4.6) = 5 lanes.
 The load on the from one W_subband_so = 8 bit subband is L_sub_so = S_lba * f_sub * N_complex *
-W_subband_so = 192 * 195312.5 * 2 * 8 = 0.6 Gbps. Per 10GbE lane this then yields maximum of
+W_subband_so = 96 * 195312.5 * 2 * 8 = 0.3 Gbps. Per 10GbE lane this then yields maximum of
-L_lane / L_sub_so = 7.8125G / 0.6G = 33.3 subbands for
+L_lane / L_sub_so = 7.8125G / 0.3G = 26.0 subbands. The 10GbE requires some spare capacity, so
-R_os = 1 and 10G / 0.75G = 13.3 subbands for R_os = 1.25. The 10GbE requires some spare
+therefore assume S_sub_so = 24 subbands / 10GbE link will just fit for R_os <= 1.28, provided
-capacity, so therefore assume S_sub_so = 12 subbands / 10GbE link will just fit for R_os <= 1.25, provided that
+that the packet overhead is < (26-24)/24 ~= 8%. Hence with one 4 * 10GbE QSFP port at the
-the packet overhead is < (13.3-12)/12 ~= 10 %. Hence with one 4 * 10GbE QSFP port at the final PN it is possible
+final PN it is possible to offload 4 * 24 = 96 subbands or 19.75 MHz bandwidth with S = 96 signal
-to offload 4 * 12 = 48 subbands or 9.375 MHz bandwidth with S_lba = 192 signal paths and W_subband_so = 8 bit. 
+paths and W_subband_so = 8 bit. The ring can be used to transport the subbands to some single
-The ring can be used to transport the subbands to some single destination PN that then performs the output via
+destination PN that then performs the output via the 4 x 10GbE ports or 40GbE port on the QSFP.
-the 4 x 10GbE ports or 40GbE port on the QSFP. The destination PN could also do subband reordering to group
+The destination PN could also do subband reordering to group subbands per S = 96 inputs.
-subbands per S_lba = 192 inputs.
+Transport via ring:
-The subbands are gathered at the output node via the ring. Using the ring avoids the need to use a 10GbE switch.
-Such a switch would need > 16 + 16 ports to support LBA + international HBA and some output ports. If the data
+The subbands are gathered at the output node via the ring. Using the ring avoids the need to use
-is gathered, then it can as well be reordered to combine all S signal inputs in a single payload. The subbands
+a 10GbE switch. Such a switch would need > 16 + 16 ports to support LBA + international HBA and
-can be send to the output node via the ring using either scheme 1 or scheme 2a:
+some output ports. If the data is gathered, then it can as well be reordered to combine all S 
+signal inputs in a single payload. The subbands can be send to the output node via the ring using
-If the subband data is transported in one packet using scheme 1, then the payload can contain all 192 signal
+either scheme 1 or scheme 2a:
-inputs. The payload size for S_sub_so = 12 subbands then becomes S * S_sub_so * N_complex * W_subband_so / W_byte
-= 192 * 12 * 2 * 8/8 = 4608 octets. The packet overhead is then (40 + 4608) / 4608 = 1.009, so 0.9 % overhead.
-Each node then inserts its local subbands at the appropriate offset in the payload. The packet size is 40 + 4608
+Select N = 48 from N_lba = 96 antennas:
-= 4648 octets and the data rate is f_sub, so the load is on all links in the ring is:
-packet size * W_byte * f_sub * R_os = 4648 * 8 * 195312.5 * 1.25 ~= 9.08 Gbps.
+AARTFAAC uses the dual pol antennas, so the signal inputs (SI) have to be selected per pair of X
+and Y polarization. AARTFAAC2.0 selects N = 48 antennas per Station. It is unlikely that 
-If the subband data is send in separate packet for each PN using scheme 2a, then the payload size for
+AARTFAAC2.0 will select more than N = 48, because if AARTFAAC2.0 rather correlates more Stations
-S_sub_so = 12 subband / 10GbE link, S_pn = 12 signal inputs per PN and W_subband_so = 8 bit becomes
+than more inputs per Station. The selection can be:
-S_pn * S_sub_so * N_complex * W_subband_so / W_byte = 12 * 12 * 2 * 8/8 = 288 octets. The packet overhead is
-then (40 + 288) / 288 = 1.14, so 14 % overhead. The packet size is 40 + 288 = 328 octets and at the end node
+- one fixed selection
-there are N-1 packets on the ring. This yields a aggregate 'packet size' of (16-1)*328 = 4920 octets. The
+- subset selection
-load on the last link in the ring is: (N-1) * packet size * W_byte * f_sub * R_os =
+- completely arbirary selection
-(16-1) * 328 * 8 * 195312.5 * 1.25 ~= 9.61 Gbps. Note that the packet overhead of 14 % is larger than the
-maximum estimated allowable packet overhead of 10 % to transport 12 subbands. The reason that 12 subbands
+The disadvantage of the fixed selection is that it rules out half of the LBA. The disadvantage
-still fit is that the last node in the ring does not have to transport its local subbands via the ring, 
+of the arbitrary selection is the book keeping within Station and TM. The N = 48 antennas can
-so the use capacity on the last link is a factor (N-1)/N less. At the output node the data from all N node
+be selected from the N_lba = 96 antennas at different stages within SDP:
-is combined into one payload of size N * 288 = 4608 octets, so the output load is ~= 9.18 Gbps (identical
-to scheme 1, because the output rate does not depend on which ring scheme was used).
+- Transport all SI to the end node and select there. The advantage is that an arbirary selection
+  can be done at the end node. 
-Both scheme 1 and scheme 2a can send offload 12 subbands per 10GbE link. The difference is that scheme 1 has
+  . With transport scheme 2a the selection of N from N_lba will be done at the end node, so all
-a load of ~9.08 Gbps on all hops, whereas for scheme 2a the load increases wit every hop and has a maximum
+    PN then send all their S_pn inputs via the ring. The disadvantage is that it doubles the
-of ~9.61 Gbps on the last hop. With scheme 1 each node has to put its
+    load on the ring and requires a selection at the offload node.
-local subbands at the right location in the packet. In this way the end node only needs to output the 
-payload, because the data is already in the subband offload payload format. With scheme 2a all nodes just
+- Transport only the selected SI per PN to the end node. For arbitrary selection this complicates
-send their local data and pass on the transit data. At the end node a dispatcher and BSN aligner are needed
+  the control per node, because different number of SI may be selected per node. A compromise can
-to align the packets from all N = 16 nodes. After that the end node needs to reorder the data from these
+  be to only support selecting all S_pn inputs for a node or none, or to only support select the
-N = 16 input payloads into the subband offload payload format. This functionality in the end node is similar
+  same S_pn/2 inputs for each node.
-to the rsp_terminal function on UniBoard1 for AARTFAAC. Scheme is specific to the ring, scheme 2a would also
+  . With transport scheme 1 the payload is passed along and each node can insert none, all or a
-work if the subband data is send to the end node via a switch (or via URI like with RSP).
+    subset of its S_pn at the allocated subband index in the payload. The payload size is fixed,
+    because it contains S signal inputs.
-With scheme 2a the ring could be used in both directions, but this does not improve the capacity of the
+  . With transport scheme 2a each node only sends the selected inputs. For arbitrary selection
-ring. With scheme 2a in one direction  the packets travel 1+2+3+...+(16-1) = 120 hops. With scheme 2a in
+    this yields payload sizes that depend on the selection, which is awkward. 
-both directions the packets travel
-1+2+3+4+5+6+7+8 = 36 hops left and 1+2+3+4+5+6+7 = 28 hops right, so total 64 hops. For the transport load
+The advantage of scheme 1 is that the output payload is already formed by the selection at each
-on the ring as a whole scheme 2 is a factor 102/64 = 1.875 more efficient. However at the end node both
+node. With scheme 2a a multiplexer is needed to combine the paylaods from all nodes into the
-schemes still have transfer the same load of 15 packets. Therefore at the end node the load for both
+output packet. If in scheme 1 a packet gets lost, then all subbands from the remote nodes that
-schemes is the same. Hence with 15+15 or (8+7)+(7+8) packets arriving at the end node, this node has no
+were already passed is lost. If in scheme 1 a packet gets lost, then only the subbands from the
-spare capacity left to receive more subband packets via these two links.
+node that send that packet are lost.
-Using the ring in both directions does reduce the latency and therefor the input buffering at the end node
-by a factor 1.875. Furthermore less hops also proportionally reduces the packet error rate. It is easier
+Design decision:
-to use the ring in only one direction, because all nodes then send in the same direction, independent of
+- Assume the SO only transports the selected subbands and uses scheme 2a. The selection is made
-their location in the ring.
+  by letting each node either send all S_pn = 12 inputs or none. Hence only N/2 = 8 nodes send
+  subbands, the other N/2 nodes are remain quite. The selected nodes are identified via the 
-At the output node the packet payload is put in an UDP/IP packet and with an SDO application header. The UDP/IP
+  channel field, e.g. if node 0, 3, 4, 5, 6, 7, 8, 11 are selected for output, then the get
-header has 8+20 = 28 octets. The SDO header in LOFAR 1.0 has 22 octets. The output packet size is 40 + 28 + 22
+  channel index 0:7 via M&C and the other nodes do not send subbands. The channel index
-+ 4608 = 4698 octets and the output data rate is packet size * W_byte * f_sub * R_os = 4698 * 8 * 195312.5 *
+  determines the order of the subbands in the output packet. In this way:
-1.25 ~= 9.18 Gbps. The output load is independent of the ring scheme. The ring has 12 full duplex 10GbE links.
+  . the ring only transports selected subbands,
-Suppose 8 of these can be allocated to subband offload, then the ring can suppport subband offload for maximum
+  . subbands can be selected from different antennas, but only in groups of S_pn = 12 per PN,
-2*8 * 12 = 192 subbands (= 37.5 MHz). This then requires 2*8 / 4 = 4 QSFP ports, on different nodes.
+  . the antenna allocation per PN must suite the required SI selections.
+For LOFAR 2.0 the number of LBA doubles to S_lba = 192, but AARTFAAC2.0 assumes that still S = 96
+will offload subbands. Assume that only the selected S = 96 signal inputs are transported via the
+ring using scheme 2a. For the LBA the ring in SDP connects N = S_lba / S_pn = 192 / 12 = 16 nodes,
+so N-1 hops. Assume the subbands are send in one direction along the ring. The subband data load
+on the last hop is then (N-1)/N * 19.2G = 15/16 * 19.2G = 18.0 Gbps, excluding packet overhead.
+Given a lane load capacity of L_lane = 7.8125 Gbps, this implies that the subband offload requires
+at least ceil(18.0 / 7.8125) = ceil(2.3) = 3 lanes. Assume that 3 lanes will be used to transport
+the S_sub_so = 64 subbands for AARTFAAC2.0. Choose S_sub_lane >= 22 subbands per lane. At the end
+node the selection from 3*22 = 66 to 64 subbands will be made.
+If the subband data is transported in one packet using scheme 1, then the payload can contain all
+96 signal inputs. The payload size for S_sub_lane = 22 subbands then becomes P_payload = 
+S * S_sub_lane * N_complex * W_subband_so / W_byte = 96 * 22 * 2 * 8/8 = 4224 octets. The packet
+overhead is P_overhead = 60, so (60 + 4224) / 4224 = 1.014, so 1.4% overhead. Each node then
+inserts its local subbands at the appropriate offset in the payload. The packet size is P_packet =
+60 + 4224 = 4288 octets and the data rate is f_sub, so the load is on all links in the ring is
+P_packet * W_byte * f_sub * R_os = 4288 * 8 * 195312.5 * 1.28 = 8.576 Gbps. Hence transporting
+S_sub_lane = 22 subbands for S = 96 SI fits on a 10GbE lane.
+If the subband data is send in separate packets for each PN using scheme 2a, then the payload size
+for S_sub_lane = 24 subbands per lane, S_pn = 12 signal inputs per PN and W_subband_so = 8 bit
+becomes S_pn * S_sub_lane * N_complex * W_subband_so / W_byte = 12 * 24 * 2 * 8/8 = 576 octets.
+The packet size is P_packet = 60 + 576 = 636 octets. The packet overhead is P_packet / P_payload
+= 636 / 576 = 1.10, so 10% overhead. At the end node there are N/2-1 packets on the ring if the
+end node is selected for offload and N/2 packets if the end not does not contribut SI for offload.
+Assume worst case N packets on the last link. The load on the last link in the ring is then
+N/2 * P_packet * W_byte * f_sub * R_os = 16/2 * 636 * 8 * 195312.5 * 1.28 ~= 10.018 Gbps. Choosing
+instead S_sub_lane = 22 subbands yields P_payload = 12 * 22 * 2 * 8/8 = 528, P_packet = 60 + 528
+= 588 and an aggregate load of 16/2 * 588 * 8 * 195312.5 * 1.28 = 9.408 Gbps, which fits on a
+10GbE lane.
+Both scheme 1 and scheme 2a can transport S_sub_lane = 22 subbands per 10GbE lane. The difference
+is that scheme 1 has a load of 8.576 Gbps on all hops, whereas for scheme 2a the load increases
+with every hop and has a maximum of 9.408 Gbps on the last hop. With scheme 1 each node has to
+put its local subbands at the right location in the packet. In this way the end node only needs
+to output the payload, because the data is already in the subband offload payload format. With 
+scheme 2a all nodes just send their local data and pass on the transit data. At the end node a 
+demultiplexer and BSN aligner are needed to align the packets from all N/2 = 16 nodes. After that
+the end node needs to reorder the data from these N/2 = 8 input payloads into the subband offload
+payload format. This functionality in the end node is similar to the rsp_terminal function on
+UniBoard1 for AARTFAAC. Scheme 1 is specific to the ring, scheme 2a would also work if the
+subband data is send to the end node via a switch (or via URI like with RSP).
+With scheme 2a the ring could be used in both directions, but this does not improve the capacity
+of the ring. With scheme 2a in one direction  the packets travel 1+2+3+...+(16-1) = 120 hops.
+With scheme 2a in both directions the packets travel 1+2+3+4+5+6+7+8 = 36 hops left and
+1+2+3+4+5+6+7 = 28 hops right, so total 64 hops. For the transport load on the ring as a whole
+using both directions is a factor 102/64 = 1.875 more efficient. However at the end node both
+the ring still has to transfer the same load of N/2 = 8 packets. Therefore at the end node the
+total load from both directions is the same. Hence with 8 packets arriving from one direction or
+2 * 4 packets arriving from two directions, the end node has no spare capacity left to receive
+more subband packets via these two links. Using the ring in both directions does reduce the
+latency and therefor the input buffering at the end node by a factor 1.875. Furthermore less
+hops also proportionally reduces the packet error rate. It is easier to use the ring in only
+one direction, because all nodes then send in the same direction, independent of their location
+in the ring. Design decision: use the ring only in one direction.
+At the output node the packet payload is put in an UDP/IP packet and with an SDO application
+header. The Ethernet oveerhead is 40 octets, the UDP/IP header has 8+20 = 28 octets and the SDO
+header in LOFAR 1.0 has 22 octets. Hence the packet overhead is P_overhead = 40 + 28 + 22 = 90
+octets. With 4 offload lanes and 16 subbands per lane, the packet size is P_packet = 90 = 3072
+= 3163 octets and the offload data rate is P_packet * W_byte * f_sub * R_os = 3072 * 8 *
+195312.5 * 1.28 ~= 6.144 Gbps. The output load is independent of the ring scheme. The ring has
+12 full duplex 10GbE links.
 Design decision:
 - Gather subbands at output node (instead of having a dedicated offload port at each node)
 - Gather the subbands via the ring (to avoid the need for a 10GbE switch with about 40 ports)
- Reorder the subbands to have all subbands from signal inputs in one payload (to ease input stage of user application)
+- Reorder the subbands to have all subbands from signal inputs in one payload (to ease input
- Use scheme 2a and in both directions (to reduce the number of hops and latency)
+  stage of user application)
+- Select subbands in groups of S_pn SI per node from N/2 = 8 nodes to have S = 96 signal
+  inputs. The SI from the other nodes are then not used and not transported.
+- Use ring transport scheme 2a and in one direction (simpler control than using both
+  directions)
+- On the ring 3 lanes are sufficient to transport 22 subbands per lane. At the output 3 lanes
+  are sufficient, but use 4 lanes and output 16 subbands per lane, to reduce the load per
+  lane.
+==> S_sub_so = 64 subbands, with W_subband_so = 8 bit, for N = 48 antennas can be offloaded
+    using 4 lanes on one QSFP port, with ~ 6.144 Gbps per lane in case of R_os = 1.28.
 *******************************************************************************
 * Transient buffer readout
 *******************************************************************************
-The transient buffer stores the data in frames of 2 kByte. A frame contains data from one signal input. The
+The transient buffer stores the data in frames of 2 kByte. A frame contains data from one signal
-memory is divided into pages and each page can contain one frame. The transient buffer readout is controlled
+input. The memory is divided into pages and each page can contain one frame. The transient buffer
-per signal input and defined by a start time and a number of pages. The start time translates into a start
+readout is controlled per signal input and defined by a start time and a number of pages. The
-page. The SCU issues the read commands per signal input. The SDP firmware then reads and outputs the 
+start time translates into a start page. The SCU issues the read commands per signal input. The
-requested frames to CEP. When the transfer has finished, then the SDP firmware sends an event message to the
+SDP firmware then reads and outputs the requested frames to CEP. The SDP firmware keeps a count 
-SCU, and then the SCU issues a read command for the next signal input, until all signal inputs have been
+of the number of frames that have been output and that still need to be output. The SCU can poll
-handled. For the ring the read out per signal input implies that at any time only one node will send data.
+these counts or wait on an event message from the SDP that signals that all frames have been
+send. When the transfer has finished, then the SCU issues a read command for the next signal
-The read frames are encoded into an DP/ETH frame. The first frame that is read is encoded with a sync and the
+input, until all signal inputs have been handled. For the ring the read out per signal input
-subsequent frames that are read can be counted via the BSN field. In this way a BSN monitor at the end node
+implies that at any time only one node will send data.
-can monitor whether all frames for a signal input read out have arrived at the end node. The end node decodes
-the frame and then encodes them into and UDP/IP/ETH frame to CEP. The transit nodes pass on the frames, and
+The read frames are encoded into a DP/ETH frame. The first frame that is read is encoded with a
-also decode the frames to be able to monitor them with a BSN monitor. After each read command has finished the
+sync and the subsequent frames that are read can be counted via the BSN field. In this way a BSN
-SCU can check the BSN monitor at the end node to know whether all frames arrived correctly at the end node.
+monitor at the end node can monitor whether all frames for a signal input read out have arrived
+at the end node. The end node decodes the frame and then encodes them into and UDP/IP/ETH frame
-For 1 Gbps data rate to CEP and packets of about 2 kByte the packet rate is 1e9/ 2000 / 8 = 62500 packets/s
+to CEP. The transit nodes pass on the frames, and also decode the frames to be able to monitor
-or about one packet every 16 us, so about every 3 T_sub. It is allowed to let multiple nodes output TB data,
+them with a BSN monitor. After each read command has finished the SCU can check the BSN monitor
-but the total number of packets/s has to still fit the output link.
+at the end node to know whether all frames arrived correctly at the end node.
+For 1 Gbps data rate to CEP and packets of about 2 kByte the packet rate is 1e9/ 2000 / 8 = 62500
+packets/s or about one packet every 16 us, so about every 3 T_sub. It is allowed to let multiple
+nodes output TB data, but the total number of packets/s has to still fit the output link. The
+readout node provides a programmable inter packet delay to throttle the output rate. The end node
+immediately outputs the packets when they arrive.

--- a/applications/lofar2/doc/prestudy/station2_to_do_erko.txt
+++ b/applications/lofar2/doc/prestudy/station2_to_do_erko.txt
@@ -227,6 +227,9 @@ Open issues:
 - Write RadioHDL article
 - Write HDL RL=0 article - desp_hdl_design_article.txt
 - XST : SNR = 1 per visibility for 10000 samples, brigthtest sourcre log 19.5 --> 4.5 dB --> T_int = 1 s is ok.
+- BSP registers:
+  . duration of operations : counts time since last power cycle (passive heartbeat)
+  . cause of reboot (power cycle, overtemperature, ...)