Skip to content
Snippets Groups Projects
Commit 3148327c authored by Eric Kooistra's avatar Eric Kooistra
Browse files

Finished draft design for SDP ring. Updated on SDP planning.

parent cef140cc
No related branches found
No related tags found
No related merge requests found
...@@ -80,140 +80,61 @@ No oversampled filterbank: ...@@ -80,140 +80,61 @@ No oversampled filterbank:
* SDP Workpackage (UniBoard2 HW + FW) * SDP Workpackage (UniBoard2 HW + FW)
******************************************************************************* *******************************************************************************
Changed tasks:
- T4.6 : 20 weeks booked explicitely for Required documents
- T4.1 : 10 weeks, because GIT, RadioHDL finished
- T4.2 : 10 weeks, because some FW done
- T4.2 : ? weeks, hardware effort
Firmware FPGA images: Firmware FPGA images:
- the SDP has one main firmware design unb2c_sdp, - the SDP has one main firmware design unb2c_sdp,
- the integrated design of SDP is revision unb2c_sdp_station, - the integrated design of SDP is revision unb2c_sdp_station,
- per task there are revisions of unb2c_sdp that contain subsets of the SDP functionality, - per task there are revisions of unb2c_sdp that contain subsets of the SDP functionality,
Deliverables (D): items that are needed for a milestone Deliverables (D) = an item, product : items that are needed for a milestone
Milestones (M) : 'cake moments' when you demonstrate deliverables Milestones (M) = a moment in time, achievement : 'cake moments' when you demonstrate or review
deliverables as part of a larger system
- integration passed - integration passed
- review passed - review passed
Planning for LOFAR2.0 Station Workpackage 4 : Station Digital Processing
Tasks: Below is the planning in weeks per task, the work includes:
- UniBoard2 hardware
- Firmware that runs on UniBoard2
INFRASTRUCTURE UniBoard2: weeks task description
weeks nr task 10 T4.1 Maintain firmware development environment (GIT, RadioHDL, HDL libraries)
20 1) Maintain firmware development environment 10 T4.2 UniBoard2 test firmware (enable mass production of UniBoard2)
- using GIT ? T4.2 UniBoard2 board hardware
- using RadioHDL 20 T4.3 UniBoard2 board support package (BSP, M&C via Gemini Protocol, use ARGS for doc, C, VHDL)
- updating existing VHDL library components 10 T4.4 Network access via 10GbE (support ARP and ping)
D=> Operational firmware development environment 20 t4.5 Ring access using test data and BSN monitor (support ring)
D=> VHDL libraries verified in simulation 20 T4.6 Required documents (SDP RS, detailed design, ICDs, FW manual)
15 T4.7 ADC input and timestamp (RCU2 interface, capture timestamped data for offline analysis)
10 T4.8 Subband filterbank (Fsub, critically sampled, SST)
20 T4.9 Subband correlator (XC, one subband per 1 s integration)
10 T4.10 Beamformer (BF, BST, beamlet output to CEP)
2) UniBoard2 board and test firmware 10 + 10 + ? + 20 + 10 + 20 + 20 + 15 + 10 + 20 + 10 = 145 + ? weeks
- unb2c board HW
D=> unb2c board detailed design document
D=> unb2c board schematic
D=> unb2c board layout
M=> unb2c board detailed design document review (unb2b modifications) Milestone : SDP ready for CDR:
M=> unb2c board schematic review All major technical UniBoard2 hardware and SDP firmware risks are mitigated:
M=> unb2c board layout review (production ready)
M=> unb2c board lab validation using JTAG, unb2c_test designs OK
M=> unb2c board production validation using JTAG, unb2c_minimal_gmi OK
5 - unb2c FPGA pinning design - by design
10 - unb2c FPGA interface test designs - SDP hardware and interfaces validated with at least two UniBoard2 using JTAG, firmware for BSP,
D=> unb2c_test design revisions (1GbE, 10GbE, DDR4, flash, ADC) ring and ADC
D=> unb2c_test_adc (read ADC samples from multiple inputs) - Station TD validated using BF beamlet output to CEP
The remaining tasks concern completing the applications that the firmware needs to perform.
20 3) UniBoard2 board support package (BSP)
- M&C by SCU via Gemini protocol weeks task description
- M&C interface definition and generation using ARGS (doc, C, HDL) 25 T4.11 Transient buffer (TB, ADC data, subband data)
D=> Gemini board for SCU M&C tests 20 T4.12 Transient detection (TDET)
D=> unb2c_minimal_gmi (1GbE, flash) 20 T4.13 Subband offload (SO) for AARTFAAC2.0
M=> unb2c_minimal_gmi validated using M&C by SCU (read design name) 20 T4.14 Station integration tests (using unb2c_sdp_station)
INFRASTRUCTURE SDP:
10 4) Network access via 10GbE
- Ethernet MAC, UDP/IPv4, ARP, ping
D=> 10GbE HDL component including support for UDP/IPv4, ARP, ping
D=> unb2c_10GbE
M=> unb2c_10GbE validated using data capture on PC and ping
20 5) Ring access using test data and BSN monitor
D=> unb2c_ring_combiner for BF
D=> unb2c_ring_multicast for XC
D=> unb2c_ring_endcast for SO, TB
M=> unb2c_ring revisions verified in simulation
M=> unb2c_ring revisions validated on hardware using M&C on SCU
APPLICATION SDP documents:
6) Required documents
D=> Detailed design document of SDP firmware
D=> L1 ICD-11109 SDP-CEP: beamlet data protocol
D=> L1 ICD-11109 SDP-CEP: transient data protocol
D=> L2 ICD-11211 SC-SDP: FW register map and register definitions
D=> L2 ICD-11211 SC-SDP: UniBoard2 hardware M&C
D=> L2 ICD-11207 RCU2S-SDP: ADC interface
D=> L2 ICD-11209 STF-SDP: Time and frequency interface
D=> L2 ICD-11218 SDP-STCA: Subrack interface
M=> SDP detailed design and interface documents ready for DDR
M=> SDP detailed design and interface documents updated for CDR
D=> SDP firmware verification and maintenance document
M=> SDP all documents finished
APPLICATION single node:
weeks nr task
15 7) ADC input and timestamp (RCU2 interface)
==> unb2c_sdp_adc_capture, read ADC or WG samples from databuffer via M&C
==> unb2c_sdp_station (ADC)
M=> SDP ready for CDR
All major technical UniBoard2 hardware and SDP firmware risks are mitigated (by design and
based on validation with at least two UniBoard2 using JTAG, unb2c_minimal_gmi, unb2c_ring,
and unb2c_sdp_adc_capture).
10 8) Subband filterbank (Fsub)
==> unb2c_sdp_filterbank to read SST via M&C
==> unb2c_sdp_station (ADC + SST)
APPLICATION multi node:
weeks nr task
20 9) Subband correlator (XC)
==> unb2c_sdp_correlator_one_node, read XST via M&C and create ACM for one node
==> unb2c_sdp_correlator_multi_node, read XST via M&C and use ring to create complete ACM
==> unb2c_sdp_station (ADC + SST + XST)
APPLICATION multi node / network output:
weeks nr task
10 10) Beamformer (BF)
==> unb2c_sdp_beamformer_bst_one_node, read BST via M&C
==> unb2c_sdp_beamformer_output_one_input, output to CEP for one input from one node
==> unb2c_sdp_beamformer_output_one_node, output to CEP and sum one node
==> unb2c_sdp_beamformer_output_multi_node, output to CEP and use ring to sum nodes
==> unb2c_sdp_station (ADC + SST + XST + BST + BF output)
==> detailed design doc
25 11) Transient buffer (TB)
==> unb2c_sdp_transient_buffer revisions (ADC + SST + TB readout, M&C access DDR4)
==> unb2c_sdp_station (ADC + SST + XST + BST + BF output + TB readout)
==> detailed design doc
20 12) Transient detection (TD)
==> unb2c_sdp_transient_buffer revisions (ADC + TD event)
==> unb2c_sdp_station (ADC + SST + XST + BST + BF output + TB readout + TD event)
==> detailed design doc
20 13) Subband offload (SO) for AARTFAAC2.0
==> unb2c_sdp_subband_offload revisions (ADC + SST + SO, one node, all nodes via ring)
==> unb2c_sdp_station (ADC + SST + XST + BST + BF output + TB readout + TD event + SO)
==> detailed design doc
INTEGRATION:
weeks nr task
20 14) Station integration tests (using unb2c_sdp_station)
- Laboratory tests
- Technical commissioning Dwingeloo Test Station ("Huisje West")
- Technical commissioning Prototype Test Station
- Technical commissioning Pre-production Test Station
25 + 20 + 20 + 20 = 85 weeks
...@@ -324,6 +324,15 @@ Scheme 1 and 2b are useful if the transit nodes also use or modify the packet da ...@@ -324,6 +324,15 @@ Scheme 1 and 2b are useful if the transit nodes also use or modify the packet da
hops are then used to multi cast the data. Scheme 2a is suitable for packet transport from start hops are then used to multi cast the data. Scheme 2a is suitable for packet transport from start
to end node, whereby transit nodes only pass on the packet. to end node, whereby transit nodes only pass on the packet.
With scheme 1 each node has a two input BSN aligner that needs to buffer a large packet. With
scheme 2 the end node has a N input BSN aligner that needs to align N small packets, Even
though scheme 2 only uses the BSN aligner at the en node, it is there at all nodes, because all
nodes run the same firmware image. Therefore the resource usage of the BSN aligner will
typically not differ much for scheme 1 or 2.
If one hop fails in scheme 1 then there is no offload. If one hop fails in scheme 2a then there
is still offload from subsequent hops.
For the beamformer beamlets scheme 1 is most suitable. The start node prepares the packet with For the beamformer beamlets scheme 1 is most suitable. The start node prepares the packet with
the initial beamlet sums. The subsequent nodes add there local beamlet sum to the packet the initial beamlet sums. The subsequent nodes add there local beamlet sum to the packet
beamlet sums and then pass on the packet. beamlet sums and then pass on the packet.
...@@ -817,150 +826,204 @@ Assumptions for Station.SDP: ...@@ -817,150 +826,204 @@ Assumptions for Station.SDP:
- The number subbands per lane is independent of set the same for R_os = 1 and R_os = 1.28. This - The number subbands per lane is independent of set the same for R_os = 1 and R_os = 1.28. This
implies that the utilization of the lanes for R_os = 1 is about a factor 1.28 less. implies that the utilization of the lanes for R_os = 1 is about a factor 1.28 less.
Select S = 96 from S_lba = 192 signal inputs
AARTFAAC uses the dual pol antennas, so the signal inputs (SI) have to be selected per pair of X
and Y polarization. The N = 48 antennas can be selected from the N_lba = 96 antennas in different
either at the offload node or at each PN:
- Transport all SI to the offload node and select there
- Select SI per PN and only transport the selected SI to the offload node
The first schemeThe selection can be programmable or fixed.
First collect all S_lba = 192 signal inputs at the offload node, and then make an arbitrary
selection or a fixed selection. The disadvantage is that this doubles the load on the ring.
- Select at each PN and transport only First collect all S_lba = 192 signal inputs at the offload node, and then make an arbitrary
selection or a fixed selection.
Use ring transport scheme 1 or scheme2a:
- With scheme 1 the selection of S out of S_lba can be made per PN, as the payload is passed along
and each node can insert none, all or a subset of its S_pn at the allocated subband index in the
payload. With scheme 2a the selection of S will be done at the offload node, so all PN then send
all their S_pn inputs via the ring. This doubles the load on the ring.
- With scheme 2a each node only has to pass on the remote packets, but at the offload node it
needs an N input BSN aligner, an N input to one output subband selection to get the offload
payload. With scheme 1 the first node initiates the offload payload and then each node has to
insert the local subbands at the correct index. This requires only a two input BSN aligner.
- If one hop fails in scheme 1 then there is no offload. If one hop fails in scheme 2a then there
is still offload from subsequent hops.
Required subband output load for AARTFAAC2.0:
Current AARTFAAC1 can offload S_sub_so = 32 subbands for S = 96 signal inputs (SI) in W_subband_so Current AARTFAAC1 can offload S_sub_so = 32 subbands for S = 96 signal inputs (SI) in W_subband_so
= 16 bit mode. On the RSP - Uniboard interface there are 9 subbands per lane, so S_sub_so = 36 in = 16 bit mode. On the RSP - Uniboard interface there are 9 subbands per lane, so S_sub_so = 36 in
total, but on the UniBoard - UDP interface to the GPU correlator only 8 subbands, so 32 in total total, but on the UniBoard - UDP interface to the GPU correlator only 8 subbands, so 32 in total
are output. The AARTFAAC1 output load is S_sub_so * S * f_sub * N_complex * W_subband_so = are output. The AARTFAAC1 output load is S_sub_so * S * f_sub * N_complex * W_subband_so =
32 * 96 * 195312.5 * 2 * 16 = 19.2 Gbps. Due to a bug in probably the RSP firmware, W_subband_so 32 * 96 * 195312.5 * 2 * 16 = 19.2 Gbps. Due to a bug (probably in the RSP firmware), W_subband_so
= 8 bit mode cannot be supported, but for LOFAR2.0 it can. Hence for the same output load as = 8 bit mode cannot be supported, but for LOFAR2.0 it can. Hence for the same output load as
AARTFAAC1, AARTFAAC2.0 can offload S_sub_so = 64 subbands, which corresponds to a bandwidth of AARTFAAC1, AARTFAAC2.0 can offload S_sub_so = 64 subbands, which corresponds to a bandwidth of
64 * 195312.5 Hz = 12.5 MHz. 64 * 195312.5 Hz = 12.5 MHz. Assume the AARTFAAC offload will use 4 10GbE links, so S_sub_lane =
16 subbands per lane. The payload size is P_payload = S_sub_lane * S * N_complex * W_subband_so /
W_byte = 64 * 96 * 2 * 8/8 = 3072 octets.
For LOFAR 2.0 the number of LBA doubles to S_lba = 192, but AARTFAAC2.0 assumes that still S = 96
will offload subbands. Assume that the S = 96 signal inputs can be selected from the S_lba = 192 Maximum subband output load per 40Gbps QSFP:
available signal inputs at the Station output. Therefore internally in the Station SDP the
subbands from all S_lba are passed on via ring to an the output node in SDP. For the LBA the ring
in SDP connects N = S_lbs / S_pn = 192 / 12 = 16 nodes, so N-1 hops. Assume all subbands are send
in one direction along the ring. The subband data load on the last hop is then
(N-1)/N * 2 * 19.2G = 15/16 * 2 * 19.2G = 36.0 Gbps, excluding packet overhead. Given a lane
load capacity of L_lane = 7.8125 Gbps, this implies that the subband offload requires at least
ceil(36.0 / 7.8125) = ceil(4.6) = 5 lanes.
The load on the from one W_subband_so = 8 bit subband is L_sub_so = S_lba * f_sub * N_complex * The load on the from one W_subband_so = 8 bit subband is L_sub_so = S_lba * f_sub * N_complex *
W_subband_so = 192 * 195312.5 * 2 * 8 = 0.6 Gbps. Per 10GbE lane this then yields maximum of W_subband_so = 96 * 195312.5 * 2 * 8 = 0.3 Gbps. Per 10GbE lane this then yields maximum of
L_lane / L_sub_so = 7.8125G / 0.6G = 33.3 subbands for L_lane / L_sub_so = 7.8125G / 0.3G = 26.0 subbands. The 10GbE requires some spare capacity, so
R_os = 1 and 10G / 0.75G = 13.3 subbands for R_os = 1.25. The 10GbE requires some spare therefore assume S_sub_so = 24 subbands / 10GbE link will just fit for R_os <= 1.28, provided
capacity, so therefore assume S_sub_so = 12 subbands / 10GbE link will just fit for R_os <= 1.25, provided that that the packet overhead is < (26-24)/24 ~= 8%. Hence with one 4 * 10GbE QSFP port at the
the packet overhead is < (13.3-12)/12 ~= 10 %. Hence with one 4 * 10GbE QSFP port at the final PN it is possible final PN it is possible to offload 4 * 24 = 96 subbands or 19.75 MHz bandwidth with S = 96 signal
to offload 4 * 12 = 48 subbands or 9.375 MHz bandwidth with S_lba = 192 signal paths and W_subband_so = 8 bit. paths and W_subband_so = 8 bit. The ring can be used to transport the subbands to some single
The ring can be used to transport the subbands to some single destination PN that then performs the output via destination PN that then performs the output via the 4 x 10GbE ports or 40GbE port on the QSFP.
the 4 x 10GbE ports or 40GbE port on the QSFP. The destination PN could also do subband reordering to group The destination PN could also do subband reordering to group subbands per S = 96 inputs.
subbands per S_lba = 192 inputs.
Transport via ring:
The subbands are gathered at the output node via the ring. Using the ring avoids the need to use a 10GbE switch.
Such a switch would need > 16 + 16 ports to support LBA + international HBA and some output ports. If the data The subbands are gathered at the output node via the ring. Using the ring avoids the need to use
is gathered, then it can as well be reordered to combine all S signal inputs in a single payload. The subbands a 10GbE switch. Such a switch would need > 16 + 16 ports to support LBA + international HBA and
can be send to the output node via the ring using either scheme 1 or scheme 2a: some output ports. If the data is gathered, then it can as well be reordered to combine all S
signal inputs in a single payload. The subbands can be send to the output node via the ring using
If the subband data is transported in one packet using scheme 1, then the payload can contain all 192 signal either scheme 1 or scheme 2a:
inputs. The payload size for S_sub_so = 12 subbands then becomes S * S_sub_so * N_complex * W_subband_so / W_byte
= 192 * 12 * 2 * 8/8 = 4608 octets. The packet overhead is then (40 + 4608) / 4608 = 1.009, so 0.9 % overhead.
Each node then inserts its local subbands at the appropriate offset in the payload. The packet size is 40 + 4608 Select N = 48 from N_lba = 96 antennas:
= 4648 octets and the data rate is f_sub, so the load is on all links in the ring is:
packet size * W_byte * f_sub * R_os = 4648 * 8 * 195312.5 * 1.25 ~= 9.08 Gbps. AARTFAAC uses the dual pol antennas, so the signal inputs (SI) have to be selected per pair of X
and Y polarization. AARTFAAC2.0 selects N = 48 antennas per Station. It is unlikely that
If the subband data is send in separate packet for each PN using scheme 2a, then the payload size for AARTFAAC2.0 will select more than N = 48, because if AARTFAAC2.0 rather correlates more Stations
S_sub_so = 12 subband / 10GbE link, S_pn = 12 signal inputs per PN and W_subband_so = 8 bit becomes than more inputs per Station. The selection can be:
S_pn * S_sub_so * N_complex * W_subband_so / W_byte = 12 * 12 * 2 * 8/8 = 288 octets. The packet overhead is
then (40 + 288) / 288 = 1.14, so 14 % overhead. The packet size is 40 + 288 = 328 octets and at the end node - one fixed selection
there are N-1 packets on the ring. This yields a aggregate 'packet size' of (16-1)*328 = 4920 octets. The - subset selection
load on the last link in the ring is: (N-1) * packet size * W_byte * f_sub * R_os = - completely arbirary selection
(16-1) * 328 * 8 * 195312.5 * 1.25 ~= 9.61 Gbps. Note that the packet overhead of 14 % is larger than the
maximum estimated allowable packet overhead of 10 % to transport 12 subbands. The reason that 12 subbands The disadvantage of the fixed selection is that it rules out half of the LBA. The disadvantage
still fit is that the last node in the ring does not have to transport its local subbands via the ring, of the arbitrary selection is the book keeping within Station and TM. The N = 48 antennas can
so the use capacity on the last link is a factor (N-1)/N less. At the output node the data from all N node be selected from the N_lba = 96 antennas at different stages within SDP:
is combined into one payload of size N * 288 = 4608 octets, so the output load is ~= 9.18 Gbps (identical
to scheme 1, because the output rate does not depend on which ring scheme was used). - Transport all SI to the end node and select there. The advantage is that an arbirary selection
can be done at the end node.
Both scheme 1 and scheme 2a can send offload 12 subbands per 10GbE link. The difference is that scheme 1 has . With transport scheme 2a the selection of N from N_lba will be done at the end node, so all
a load of ~9.08 Gbps on all hops, whereas for scheme 2a the load increases wit every hop and has a maximum PN then send all their S_pn inputs via the ring. The disadvantage is that it doubles the
of ~9.61 Gbps on the last hop. With scheme 1 each node has to put its load on the ring and requires a selection at the offload node.
local subbands at the right location in the packet. In this way the end node only needs to output the
payload, because the data is already in the subband offload payload format. With scheme 2a all nodes just - Transport only the selected SI per PN to the end node. For arbitrary selection this complicates
send their local data and pass on the transit data. At the end node a dispatcher and BSN aligner are needed the control per node, because different number of SI may be selected per node. A compromise can
to align the packets from all N = 16 nodes. After that the end node needs to reorder the data from these be to only support selecting all S_pn inputs for a node or none, or to only support select the
N = 16 input payloads into the subband offload payload format. This functionality in the end node is similar same S_pn/2 inputs for each node.
to the rsp_terminal function on UniBoard1 for AARTFAAC. Scheme is specific to the ring, scheme 2a would also . With transport scheme 1 the payload is passed along and each node can insert none, all or a
work if the subband data is send to the end node via a switch (or via URI like with RSP). subset of its S_pn at the allocated subband index in the payload. The payload size is fixed,
because it contains S signal inputs.
With scheme 2a the ring could be used in both directions, but this does not improve the capacity of the . With transport scheme 2a each node only sends the selected inputs. For arbitrary selection
ring. With scheme 2a in one direction the packets travel 1+2+3+...+(16-1) = 120 hops. With scheme 2a in this yields payload sizes that depend on the selection, which is awkward.
both directions the packets travel
1+2+3+4+5+6+7+8 = 36 hops left and 1+2+3+4+5+6+7 = 28 hops right, so total 64 hops. For the transport load The advantage of scheme 1 is that the output payload is already formed by the selection at each
on the ring as a whole scheme 2 is a factor 102/64 = 1.875 more efficient. However at the end node both node. With scheme 2a a multiplexer is needed to combine the paylaods from all nodes into the
schemes still have transfer the same load of 15 packets. Therefore at the end node the load for both output packet. If in scheme 1 a packet gets lost, then all subbands from the remote nodes that
schemes is the same. Hence with 15+15 or (8+7)+(7+8) packets arriving at the end node, this node has no were already passed is lost. If in scheme 1 a packet gets lost, then only the subbands from the
spare capacity left to receive more subband packets via these two links. node that send that packet are lost.
Using the ring in both directions does reduce the latency and therefor the input buffering at the end node
by a factor 1.875. Furthermore less hops also proportionally reduces the packet error rate. It is easier Design decision:
to use the ring in only one direction, because all nodes then send in the same direction, independent of - Assume the SO only transports the selected subbands and uses scheme 2a. The selection is made
their location in the ring. by letting each node either send all S_pn = 12 inputs or none. Hence only N/2 = 8 nodes send
subbands, the other N/2 nodes are remain quite. The selected nodes are identified via the
At the output node the packet payload is put in an UDP/IP packet and with an SDO application header. The UDP/IP channel field, e.g. if node 0, 3, 4, 5, 6, 7, 8, 11 are selected for output, then the get
header has 8+20 = 28 octets. The SDO header in LOFAR 1.0 has 22 octets. The output packet size is 40 + 28 + 22 channel index 0:7 via M&C and the other nodes do not send subbands. The channel index
+ 4608 = 4698 octets and the output data rate is packet size * W_byte * f_sub * R_os = 4698 * 8 * 195312.5 * determines the order of the subbands in the output packet. In this way:
1.25 ~= 9.18 Gbps. The output load is independent of the ring scheme. The ring has 12 full duplex 10GbE links. . the ring only transports selected subbands,
Suppose 8 of these can be allocated to subband offload, then the ring can suppport subband offload for maximum . subbands can be selected from different antennas, but only in groups of S_pn = 12 per PN,
2*8 * 12 = 192 subbands (= 37.5 MHz). This then requires 2*8 / 4 = 4 QSFP ports, on different nodes. . the antenna allocation per PN must suite the required SI selections.
For LOFAR 2.0 the number of LBA doubles to S_lba = 192, but AARTFAAC2.0 assumes that still S = 96
will offload subbands. Assume that only the selected S = 96 signal inputs are transported via the
ring using scheme 2a. For the LBA the ring in SDP connects N = S_lba / S_pn = 192 / 12 = 16 nodes,
so N-1 hops. Assume the subbands are send in one direction along the ring. The subband data load
on the last hop is then (N-1)/N * 19.2G = 15/16 * 19.2G = 18.0 Gbps, excluding packet overhead.
Given a lane load capacity of L_lane = 7.8125 Gbps, this implies that the subband offload requires
at least ceil(18.0 / 7.8125) = ceil(2.3) = 3 lanes. Assume that 3 lanes will be used to transport
the S_sub_so = 64 subbands for AARTFAAC2.0. Choose S_sub_lane >= 22 subbands per lane. At the end
node the selection from 3*22 = 66 to 64 subbands will be made.
If the subband data is transported in one packet using scheme 1, then the payload can contain all
96 signal inputs. The payload size for S_sub_lane = 22 subbands then becomes P_payload =
S * S_sub_lane * N_complex * W_subband_so / W_byte = 96 * 22 * 2 * 8/8 = 4224 octets. The packet
overhead is P_overhead = 60, so (60 + 4224) / 4224 = 1.014, so 1.4% overhead. Each node then
inserts its local subbands at the appropriate offset in the payload. The packet size is P_packet =
60 + 4224 = 4288 octets and the data rate is f_sub, so the load is on all links in the ring is
P_packet * W_byte * f_sub * R_os = 4288 * 8 * 195312.5 * 1.28 = 8.576 Gbps. Hence transporting
S_sub_lane = 22 subbands for S = 96 SI fits on a 10GbE lane.
If the subband data is send in separate packets for each PN using scheme 2a, then the payload size
for S_sub_lane = 24 subbands per lane, S_pn = 12 signal inputs per PN and W_subband_so = 8 bit
becomes S_pn * S_sub_lane * N_complex * W_subband_so / W_byte = 12 * 24 * 2 * 8/8 = 576 octets.
The packet size is P_packet = 60 + 576 = 636 octets. The packet overhead is P_packet / P_payload
= 636 / 576 = 1.10, so 10% overhead. At the end node there are N/2-1 packets on the ring if the
end node is selected for offload and N/2 packets if the end not does not contribut SI for offload.
Assume worst case N packets on the last link. The load on the last link in the ring is then
N/2 * P_packet * W_byte * f_sub * R_os = 16/2 * 636 * 8 * 195312.5 * 1.28 ~= 10.018 Gbps. Choosing
instead S_sub_lane = 22 subbands yields P_payload = 12 * 22 * 2 * 8/8 = 528, P_packet = 60 + 528
= 588 and an aggregate load of 16/2 * 588 * 8 * 195312.5 * 1.28 = 9.408 Gbps, which fits on a
10GbE lane.
Both scheme 1 and scheme 2a can transport S_sub_lane = 22 subbands per 10GbE lane. The difference
is that scheme 1 has a load of 8.576 Gbps on all hops, whereas for scheme 2a the load increases
with every hop and has a maximum of 9.408 Gbps on the last hop. With scheme 1 each node has to
put its local subbands at the right location in the packet. In this way the end node only needs
to output the payload, because the data is already in the subband offload payload format. With
scheme 2a all nodes just send their local data and pass on the transit data. At the end node a
demultiplexer and BSN aligner are needed to align the packets from all N/2 = 16 nodes. After that
the end node needs to reorder the data from these N/2 = 8 input payloads into the subband offload
payload format. This functionality in the end node is similar to the rsp_terminal function on
UniBoard1 for AARTFAAC. Scheme 1 is specific to the ring, scheme 2a would also work if the
subband data is send to the end node via a switch (or via URI like with RSP).
With scheme 2a the ring could be used in both directions, but this does not improve the capacity
of the ring. With scheme 2a in one direction the packets travel 1+2+3+...+(16-1) = 120 hops.
With scheme 2a in both directions the packets travel 1+2+3+4+5+6+7+8 = 36 hops left and
1+2+3+4+5+6+7 = 28 hops right, so total 64 hops. For the transport load on the ring as a whole
using both directions is a factor 102/64 = 1.875 more efficient. However at the end node both
the ring still has to transfer the same load of N/2 = 8 packets. Therefore at the end node the
total load from both directions is the same. Hence with 8 packets arriving from one direction or
2 * 4 packets arriving from two directions, the end node has no spare capacity left to receive
more subband packets via these two links. Using the ring in both directions does reduce the
latency and therefor the input buffering at the end node by a factor 1.875. Furthermore less
hops also proportionally reduces the packet error rate. It is easier to use the ring in only
one direction, because all nodes then send in the same direction, independent of their location
in the ring. Design decision: use the ring only in one direction.
At the output node the packet payload is put in an UDP/IP packet and with an SDO application
header. The Ethernet oveerhead is 40 octets, the UDP/IP header has 8+20 = 28 octets and the SDO
header in LOFAR 1.0 has 22 octets. Hence the packet overhead is P_overhead = 40 + 28 + 22 = 90
octets. With 4 offload lanes and 16 subbands per lane, the packet size is P_packet = 90 = 3072
= 3163 octets and the offload data rate is P_packet * W_byte * f_sub * R_os = 3072 * 8 *
195312.5 * 1.28 ~= 6.144 Gbps. The output load is independent of the ring scheme. The ring has
12 full duplex 10GbE links.
Design decision: Design decision:
- Gather subbands at output node (instead of having a dedicated offload port at each node) - Gather subbands at output node (instead of having a dedicated offload port at each node)
- Gather the subbands via the ring (to avoid the need for a 10GbE switch with about 40 ports) - Gather the subbands via the ring (to avoid the need for a 10GbE switch with about 40 ports)
- Reorder the subbands to have all subbands from signal inputs in one payload (to ease input stage of user application) - Reorder the subbands to have all subbands from signal inputs in one payload (to ease input
- Use scheme 2a and in both directions (to reduce the number of hops and latency) stage of user application)
- Select subbands in groups of S_pn SI per node from N/2 = 8 nodes to have S = 96 signal
inputs. The SI from the other nodes are then not used and not transported.
- Use ring transport scheme 2a and in one direction (simpler control than using both
directions)
- On the ring 3 lanes are sufficient to transport 22 subbands per lane. At the output 3 lanes
are sufficient, but use 4 lanes and output 16 subbands per lane, to reduce the load per
lane.
==> S_sub_so = 64 subbands, with W_subband_so = 8 bit, for N = 48 antennas can be offloaded
using 4 lanes on one QSFP port, with ~ 6.144 Gbps per lane in case of R_os = 1.28.
******************************************************************************* *******************************************************************************
* Transient buffer readout * Transient buffer readout
******************************************************************************* *******************************************************************************
The transient buffer stores the data in frames of 2 kByte. A frame contains data from one signal input. The The transient buffer stores the data in frames of 2 kByte. A frame contains data from one signal
memory is divided into pages and each page can contain one frame. The transient buffer readout is controlled input. The memory is divided into pages and each page can contain one frame. The transient buffer
per signal input and defined by a start time and a number of pages. The start time translates into a start readout is controlled per signal input and defined by a start time and a number of pages. The
page. The SCU issues the read commands per signal input. The SDP firmware then reads and outputs the start time translates into a start page. The SCU issues the read commands per signal input. The
requested frames to CEP. When the transfer has finished, then the SDP firmware sends an event message to the SDP firmware then reads and outputs the requested frames to CEP. The SDP firmware keeps a count
SCU, and then the SCU issues a read command for the next signal input, until all signal inputs have been of the number of frames that have been output and that still need to be output. The SCU can poll
handled. For the ring the read out per signal input implies that at any time only one node will send data. these counts or wait on an event message from the SDP that signals that all frames have been
send. When the transfer has finished, then the SCU issues a read command for the next signal
The read frames are encoded into an DP/ETH frame. The first frame that is read is encoded with a sync and the input, until all signal inputs have been handled. For the ring the read out per signal input
subsequent frames that are read can be counted via the BSN field. In this way a BSN monitor at the end node implies that at any time only one node will send data.
can monitor whether all frames for a signal input read out have arrived at the end node. The end node decodes
the frame and then encodes them into and UDP/IP/ETH frame to CEP. The transit nodes pass on the frames, and The read frames are encoded into a DP/ETH frame. The first frame that is read is encoded with a
also decode the frames to be able to monitor them with a BSN monitor. After each read command has finished the sync and the subsequent frames that are read can be counted via the BSN field. In this way a BSN
SCU can check the BSN monitor at the end node to know whether all frames arrived correctly at the end node. monitor at the end node can monitor whether all frames for a signal input read out have arrived
at the end node. The end node decodes the frame and then encodes them into and UDP/IP/ETH frame
For 1 Gbps data rate to CEP and packets of about 2 kByte the packet rate is 1e9/ 2000 / 8 = 62500 packets/s to CEP. The transit nodes pass on the frames, and also decode the frames to be able to monitor
or about one packet every 16 us, so about every 3 T_sub. It is allowed to let multiple nodes output TB data, them with a BSN monitor. After each read command has finished the SCU can check the BSN monitor
but the total number of packets/s has to still fit the output link. at the end node to know whether all frames arrived correctly at the end node.
For 1 Gbps data rate to CEP and packets of about 2 kByte the packet rate is 1e9/ 2000 / 8 = 62500
packets/s or about one packet every 16 us, so about every 3 T_sub. It is allowed to let multiple
nodes output TB data, but the total number of packets/s has to still fit the output link. The
readout node provides a programmable inter packet delay to throttle the output rate. The end node
immediately outputs the packets when they arrive.
......
...@@ -227,6 +227,9 @@ Open issues: ...@@ -227,6 +227,9 @@ Open issues:
- Write RadioHDL article - Write RadioHDL article
- Write HDL RL=0 article - desp_hdl_design_article.txt - Write HDL RL=0 article - desp_hdl_design_article.txt
- XST : SNR = 1 per visibility for 10000 samples, brigthtest sourcre log 19.5 --> 4.5 dB --> T_int = 1 s is ok. - XST : SNR = 1 per visibility for 10000 samples, brigthtest sourcre log 19.5 --> 4.5 dB --> T_int = 1 s is ok.
- BSP registers:
. duration of operations : counts time since last power cycle (passive heartbeat)
. cause of reboot (power cycle, overtemperature, ...)
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment