Skip to content
Snippets Groups Projects
Commit deba7121 authored by Eric Kooistra's avatar Eric Kooistra
Browse files

Updated BF and XC part of station2_sdp_ring.txt.

parent b8cb8ce9
No related branches found
No related tags found
No related merge requests found
...@@ -44,11 +44,14 @@ M&C: ...@@ -44,11 +44,14 @@ M&C:
The (x+y) could be implemented as first (x+y) and then *w, or as first weight and then add. The (x+y) could be implemented as first (x+y) and then *w, or as first weight and then add.
******************************************************************************* *******************************************************************************
* Subband correlator * Subband correlator
******************************************************************************* *******************************************************************************
First the local crosslets are correlated with themselves and then
the local crosslets are kept in a barrel shifter, such that they can also be correlated with the
remote crosslets that arrive in the packets.
******************************************************************************* *******************************************************************************
......
...@@ -44,7 +44,7 @@ ...@@ -44,7 +44,7 @@
- RSP RAD frame: - RSP RAD frame:
. uses: FSI, FSN, DATA, CRC. . uses: FSI, FSN, DATA, CRC.
. The FSN is 16 bit but the MSbit is used for the sync. The other 15 bits count blocks. . The FSN is 16 bit but the MSbit is used for the sync. The other 15 bits count blocks.
. After Rx frame the FSI is stripped and the CRC is replace by a BRC. . After Rx frame the FSI is stripped and the CRC is replace by a boolean check (BRC).
- CRC Error checking: - CRC Error checking:
The CRC is a 32 bit number, so the chance that the CRC results in a false positive is 1/2**32 ~= 2.3e-10 or 1 The CRC is a 32 bit number, so the chance that the CRC results in a false positive is 1/2**32 ~= 2.3e-10 or 1
...@@ -123,7 +123,7 @@ to ensure that all inputs have the same 64 bit sync and BSN. ...@@ -123,7 +123,7 @@ to ensure that all inputs have the same 64 bit sync and BSN.
******************************************************************************* *******************************************************************************
* BSN aligner * BSN aligner dp_bsn_align_v2
******************************************************************************* *******************************************************************************
Assumptions: Assumptions:
...@@ -441,6 +441,12 @@ Design options: ...@@ -441,6 +441,12 @@ Design options:
- flush per packet or flush until empty? - flush per packet or flush until empty?
- flush per input per input or flush all inputs? - flush per input per input or flush all inputs?
- flush by reading, or by reset or by moving a Rd pointer - flush by reading, or by reset or by moving a Rd pointer
A FIFO can be flushed by resetting it, but this requires careful control to ensure that the reset is
noticed in both clock domains, and that the reset is applied in between input packets to avoid that
only a tail of a packet gets into a FIFO. Therefore in LOFAR 1.0 and APERTIF a FIFO is flushed by
reading the packets from it until it is empty. This scheme also allows flushing per packet. The
disadvantage of reading the packets and the discard them, is that it takes as long as reading at full
speed.
- Use packet count instead of FIFO full indicator - Use packet count instead of FIFO full indicator
- can we do without flushing the FIFO? Not if we need to realign. - can we do without flushing the FIFO? Not if we need to realign.
- If multiple packets on a remote input get lost, then the other inputs fill up if there is no timeout. Flush - If multiple packets on a remote input get lost, then the other inputs fill up if there is no timeout. Flush
......
...@@ -2,60 +2,118 @@ Detailed design: RING ...@@ -2,60 +2,118 @@ Detailed design: RING
******************************************************************************* *******************************************************************************
* Data format * Data rate
******************************************************************************* *******************************************************************************
Support for oversampled subband filterbank Support for oversampled subband filterbank
The oversampling increases the processing rate and data rate by a factor R_os. Typical R_os are 32/28 = 1.142, The oversampling increases the processing rate and data rate by a factor R_os. Typical R_os are
32/27 = 1.185, 32/26 = 1.231, 32/25 = 1.28, 32/24 = 1.333. Assume R_os <= 1.28. 32/28 = 1.142, 32/27 = 1.185, 32/26 = 1.231, 32/25 = 1.28, 32/24 = 1.333. Assume R_os <= 1.28.
Assume the processing for critically sampled filterbank runs at 200 MHz and for oversampled subbands it will run at Processing capacity per subband period:
R_os * 200 MHz. For R_os = 1.28 this requires processing at >= 256 MHz. In this way if the processing fits for the Assume the processing for critically sampled filterbank runs at 200 MHz and for oversampled
critically sampled subbands, then it will also fit for the oversampled subbands. subbands it will run at R_os * 200 MHz. For R_os = 1.28 this requires processing at >= 256 MHz.
This means that the processing has N_fft = 1024 clock cycles avaiable per subband period T_sub,
The IO data rate on the ring increases with the oversampling factor R_os. For oversampled data the ring 10GbE has independent of R_os. In this way if the processing for the critically sampled subbands fits
the full 10 Gbps capacity and for critically sampled data the effective ring capacity becomes 10G / R_os = within N_clk = N_fft = 1024 clock cycles, then it will also fit for the oversampled subbands.
10G / 1.28 = 7.8125 Gbps. The aim is to be able to replace the critically sampled filterbank by an oversampled
filterbank without having to change other parts in the design. Therefore assume that the ring capacity for the IO capacity per 10GbE lane:
critically sampled data is restricted to 7.8125 Gbps. The alternative to use full ring capacity for critically The IO data rate on the ring increases with the oversampling factor R_os. For oversampled data
sampled data and then support less (S_sub_bf / R_os = 488 / 1.28) beamlets for oversampled data is not compliant the ring 10GbE has the full 10 Gbps capacity and for critically sampled data the effective
with the requirement of S_sub_bf = 488. ring capacity per lane becomes L_lane = 10G / R_os = 10G / 1.28 = 7.8125 Gbps. The aim is to
be able to replace the critically sampled filterbank by an oversampled filterbank without
having to change other parts in the design. Therefore assume that the ring capacity for the
critically sampled data is restricted to L_lane < 7.8125 Gbps.
Note:
The alternative to use full ring capacity for critically sampled data and then support less
(S_sub_bf / R_os = 488 / 1.28 = 381, so almost 30 % less) beamlets for oversampled data is not
compliant with the requirement of S_sub_bf = 488.
Design descision: Support S_sub_bf = 488 also for maximum R_os = 1.28. Design descision: Support S_sub_bf = 488 also for maximum R_os = 1.28.
W_beamlet_sum W_beamlet_sum
LOFAR 1.0 had 24 bit for 16 bit beamlet mode and 12 bit for 8 bit beamlet mode. LOFAR 2.0 will only support 8 bit. LOFAR 1.0 had 24 bit for 16 bit beamlet mode and 12 bit for 8 bit beamlet mode. LOFAR 2.0 will
Using W_beamlet_sum = 18 bit provides 5 bits more dynamic range for 8 bit beamlet mode, which is sufficient to only support 8 bit. Using W_beamlet_sum = 18 bit provides 5 bits more dynamic range for 8 bit
detect overflow. Using W_beamlet_sum = 18 bit also fits the input data width of the FPGA hard core multipliers in beamlet mode, which is sufficient to detect overflow. Using W_beamlet_sum = 18 bit also fits the
the BST. Given that the signal input level is 4 bit the beamformer could round 2 LSbit to effectively achieve input data width of the FPGA hard core multipliers in the BST. Given that the SDP signal input
20 bit dynamic range, even for S = 1 signal input. However the same effect can also be achieved by reducing the level is 4 bit the beamformer could round 2 LSbit to effectively achieve 20 bit dynamic range,
beamlet weights by a factor 2**2 = 4. Choose the same W_beamlet_sum = 18 bit for both the critically sampled even for S = 1 signal input. However the same effect can also be achieved by reducing the beamlet
beamlet data and the oversampled beamlet data, to avoid differences in the design. weights by a factor 2**2 = 4. Choose the same W_beamlet_sum = 18 bit for both the critically
The beamlet sum that is transported across the ring needs to fit on a 10GbE link. With S_sub_bf = 488 and R_os <= sampled beamlet data and the oversampled beamlet data, to avoid differences in the design. The
1.28 the data rate for one full band station beam is N_pol * S_sub_bf * f_sub * R_os * N_complex * W_beamlet_sum beamlet sum that is transported across the ring needs to fit on a 10GbE lane. With S_sub_bf = 488
= 2 * 488 * 195312.5 * 1.28 * 2 * 18 = 8.784 Gbps. This leaves about 13.8 % margin for packet overhead, which is the data rate for one full band station beam is N_pol * S_sub_bf * f_sub * N_complex *
sufficient. Using W_beamlet_sum = 18 bit fits the input data width of the FPGA hard core multipliers and also W_beamlet_sum = 2 * 488 * 195312.5 * 2 * 18 = 6.8625 Gbps. Using L_lane = 7.8125 Gbps this leaves
provides sufficent dynamic range to scale the final beamlet sum to W_beamlet = 8 bit for output. about 1 - 6.8625 / 7.8125 = 12 % margin for packet overhead, which is sufficient.
Design descision: W_beamlet_sum = 18 bit for both critically sampled beamlet and oversampled beamlets
Design descision:
Use W_beamlet_sum = 18 bit for both critically sampled beamlet and oversampled beamlets.
Using W_beamlet_sum = 18 bit fits the on one 10GbE lane on the ring, fits the input data width
of the FPGA hard core multipliers and also provides sufficent dynamic range to scale the final
beamlet sum to W_beamlet = 8 bit for output.
******************************************************************************* *******************************************************************************
* Ring function * Ring links:
******************************************************************************* *******************************************************************************
Ring transceiver medium access (MAC): OSI 1 Phyisical layer: Transceivers
Use Ethernet per transceiver link.The Ethernet MAC provides link establishment, so it uses a full duplex transceiver. The
Ethernet packet header contains destination MAC address, source MAC address and Ethernet type. The Ethernet packet tail OSI 2 Data link layer:
contains a CRC. The CRC provides data error detection. No need to use UDP/IP and ARP, because the links in the ring are Use Ethernet per transceiver link.The Ethernet MAC provides link establishment, so it uses a full
point to point and will not be used in a network. The Ethernet fields can be used as: duplex transceiver. The Ethernet packet header contains destination MAC address, source MAC
address and Ethernet type. The Ethernet packet tail contains a CRC. The CRC provides data error
detection. There is no need to use UDP/IP and ARP, because the links in the ring are point to
point and will not be used in a network. The Ethernet fields can be used as:
- Destination MAC = destination PN index - Destination MAC = destination PN index
- Source MAC = source PN index - Source MAC = source PN index
- Ethernet type = packet type - Ethernet type = packet type
Design decision: Use Ethernet for the ring transceiver links Design decision: Use Ethernet for the ring transceiver links
Ring application packet types:
Use 10GbE or 40GbE:
From the low-latency Ethernet core user guides it follows that the Ethernet core with statistics
registers use:
- 10GbE core : 4300 FF, 4 M9K
- 40GbE core : 21200 FF, 13 M20K
The synhesis fitter results from Apertif BF and XC show that the tech_eth_10g takes about 5500 FF
and 4 (BF) or 7 (XC) M9K.The BF MAC has no statistics, the XC MAC does have statistics.
Hence the 40GbE core is about a factor 4 larger than the 10GbE core, so from a resource usage point
of view it is does not matter whether we use 4 x 10GbE or 1 x 40GbE. The advantage of 40GbE is
that it can fit data rates > 10Gbps per data type stream. The advantage of using 10GbE is that we
can use one link per data type stream and thereby avoid having to multiplex different data streams
onto the same 40GbE link. However some multiplexing of local packets and remote transit packets can
also be needed. UniBoard2 has been tested with 10GbE but not yet with 40GbE.
The Arria10 on UniBoard2 has 1708800 FF so 1708800 / 182400 = 9.3 times more than the Stratix IV
on UniBoard1. On UniBoard2 one 10GbE interface uses maximum about 5500 / 1708800 = 0.32 % of the
FF and maximum about 7 / 2713 = 0.25% of the block RAM. In total there will be 4 x 10GbE for the
intra board ring, 4 x 10GbE for the inter board ring and 1 x 10GbE for external IO, so these will
take about 3% of the FF and block RAM resources.
The packet rate is f_sub = 195312.5 Hz. At 10GbE this means that the maximum packet size is
10e9/195312.5 = 6400 octets. For oversampled subbands the maximum packet size drops to about
6400 / 1.25 = 5120 octets. If the minimum packet size is e.g. 4000 octets, then at 10GbE this
means that the link cannot be fully used, whereas at 40GbE multiple packets will still fit. The
maximum packet size for 10GbE also depends on the number of packets on the ring:
. With one packet on the ring the maximum packet size for 10GbE is 5120 (R_os = 1.25) octets,
. With N = 16 nodes and all nodes sending to the same end node the maximum packet size is 5120/16
= 320 (R_os = 1.25) octets,
. If the packet only needs to travel N/2 nodes then the maximum packet size is 5120/8 = 640
(R_os = 1.25) octets.
Design descision: Assume the ring will use 4 x 10GbE, because it is known technology and suitable.
Internally in the FPGA the 10GbE data on the ring interface is available as 64 bit data at
156.25 MHz (64 * 156.25M = 10G).
Ring application Ethernet packet types:
The ring is used for the following application packet types: The ring is used for the following application packet types:
- 0x10FB for beamlets, - 0x10FB for beamlets,
...@@ -63,136 +121,235 @@ The ring is used for the following application packet types: ...@@ -63,136 +121,235 @@ The ring is used for the following application packet types:
- 0x10FD for subband offload, - 0x10FD for subband offload,
- 0x10FE for transient buffer read out - 0x10FE for transient buffer read out
The packet type information can be transported via the Ethernet type field or via an UDP port number. If each link The packet type information can be transported via the Ethernet type field or via an UDP port
is only used for one kind of packet type, then the packet type is only used for information, because the PN number. If each lane is only used for one kind of packet type, then the packet type is only used
already knows the packet type. The packet type value is based on packet types that were defined in RSP, where for information, because the PN already knows the packet type. The packet type value is based
0x10FA was used to identify M&C data (0x10FA ~= LOFAR) and the other type values just increment the 0x10FA value. on packet types that were defined in RSP, where 0x10FA was used to identify M&C data (0x10FA ~=
LOFAR) and the other type values just increment the 0x10FA value.
Design decision: Transport application packet type via Ethernet type field for information Design decision: Transport application packet type via Ethernet type field for information
Use UDP/IP/ETH or only ETH on the ring:
We already have a UDP offload component that supports UDP/IP/ETH, but a similar component that only supports ETH is OSI 3 Network layer: Use ring
easily derived from it. With an UDP the LOFAR packet type information can be transported via the UDP port field.
Using UDP/IP makes it easier to send the data to a PC for monitoring purposes, however it is also possible to sniff Wormhole routing (or cut-through routing) or store-and-forward routing:
raw Ethernet packets on a PC. Using a PC to verify the ring allows capturing large amounts of data. On an FPGA we With worm hole routing a received packet or a received and modified packet is already
can use a data buffer to sniff the packets, but only a few. transmitted, while the tail of the packet is still being received. The advantage of wormhole
The extra overhead of UDP = 8 octets and IP = 20, so 28 octets in total. The disadvantage of using UDP/IP is that routing is that it minimizes the latency along the ring and therefore also local buffering to
it adds some extra traffic overhead and uses some extra logic resources, but that could be acceptable. The align between local and remote data. The disadvantage of wormhole routing is that a CRC error
disadvantage of verifying the ring using a PC are: on the received packet needs to be propagated by forcing the CRC of the transmitted packet to be
wrong. This implies that all subsequent hops will show this CRC error. For link diagnoses this
is confusing, because the subsequent links did not cause the CRC error. With store-and-forward
routing a packet is first received entirely before it is passed on for transmit. This allows to
discard a received packet with a CRC error, but does increase the latency on the ring. Packets
with a CRC error cannot be allowed to enter the processing in the node, because any bit in the
packet may be corrupted, especially in the packet header, so no meaningfull processing is
possible.
Design descision:
For LOFAR 2.0 choose to use store-and-forward, because it allows discarding packets with CRC
errors when they occur and because there is sufficient internal block RAM to buffer the local
data for the worst case ring latency.
Only accept correct packets:
Discard all packets that have a CRC error. This also prevents that packets of wrong length enter
the internal processing. The Ethernet CRC error is 32 bit, so it is very unlikely that packet with
errors still has a correct CRC. With wormhole routing it was necessary to limit or extend a packet
to a known fixed length, because also packets with CRC error are passed on. With store-and-forward
routing the CRC provides sufficient protection to ensure that only correct packets enter the
application.
Ring latency:
The latency of 1 hop is about 0.2 us. The time to transmit one Ethernet frame of 1500 octets at
10Gbps is about 1.2 us and a jumbo frame of 6400 octets takes about 5.12 us (= T_sub). Hence for
packets >~ 300 octets the ring latency is dominated by the store-and-forward routing at each node.
The 10GbE Ethernet MAC uses 64 bit data. At 200 MHz this can achieve 64 * 0.2 = 12.8 Gbps. Hence
if the processing operates without data valid gaps, then the Ethernet transmit will not run empty
during a payload. Therefore it is not necessary to use a fill FIFO, which would add to the ring
latency. For a packet that travels the entire ring the latency is then about (N-1) * T_sub and
the corresponding FIFO depth to align the local data with this remote data is (N-1) * packet size.
OSI 4 Transport layer: Use UDP/IP/ETH or only ETH on the ring:
We already have a UDP offload component that supports DP/UDP/IP/ETH, but a similar component that
only supports DP/ETH is easily derived from it. With an UDP the LOFAR packet type information can
be transported via the UDP port field. Using UDP/IP makes it easier to send the data to a PC for
monitoring purposes, however it is also possible to sniff raw Ethernet packets on a PC. Using a
PC to verify the ring allows capturing large amounts of data. On an FPGA we can use a data buffer
to sniff the packets, but only a few. The extra overhead of UDP = 8 octets and IP = 20, so 28
octets in total. The disadvantage of using UDP/IP is that it adds some extra traffic overhead and
uses some extra logic resources, but that could be acceptable. The disadvantage of verifying the
ring using a PC are:
- between FPGAs on the same UniBoard the ring can only be observed on the FPGA - between FPGAs on the same UniBoard the ring can only be observed on the FPGA
- the ring will only connect FPGAs in the application, so using a PC is a side track that as such may cause extra - the ring will only connect FPGAs in the application, so using a PC is a side track that as such
work. may cause extra work.
Using UDP/IP does not make it possible to replace the ring by a switch without modifications, so changing from a
ring based design to a switch based design will still imply a redesign of the data transport scheme. Using UDP/IP does not make it possible to replace the ring by a switch without modifications, so
Design decision: Use raw Ethernet and verification on FPGA, because that fits the ring (especially between FPGAs changing from a ring based design to a switch based design will still imply a redesign of the
on UniBoard2) and avoids the extra overhead of UDP/IP. data transport scheme.
Ring application header: Design decision:
The packet payload needs to have an application header to carry the timestamp and a stream identifier. This Use raw ETH and verification on FPGA, because that fits the ring (especially between FPGAs on
information can be tranported via the DP packet header which has a BSN field and a channel field. The BSN is the UniBoard2) and avoids the extra overhead of UDP/IP.
timestamp. The channel field can carry the source PN index and destination PN index. These PN indices are also
available in the ETH source and destination MAC addresses of ETH encoded packets, but they also need to be
available in ETH decoded packets. In ETH encoded packets the destination MAC address allow direct pass on of Ring application header DP/ETH:
transit packets on the ring, without having to ETH decode them. In ETH decoded packets the BSN and channel fields The packet payload needs to have an application header to carry the timestamp and a stream
can be passed along inside the encoded DP packet or in parallel with the decoded DP packet application data. The identifier. This information can be tranported via the DP packet header which has a BSN field and
channel information can be used to process the remote packets in parallel e.g. per source PN index. a channel field. The BSN is the timestamp. The channel field can carry the source PN index and
destination PN index. These PN indices are also available in the ETH source and destination MAC
addresses of ETH encoded packets, but they also need to be available in ETH decoded packets. In
ETH encoded packets the destination MAC address allow direct pass on of transit packets on the
What is the ETH packet overhead? ring, without having to ETH decode them. In ETH decoded packets the BSN and channel fields can be
The ETH packet overhead consists of: passed along inside the encoded DP packet or in parallel with the decoded DP packet application
data. The channel information can be used to process the remote packets in parallel e.g. per
source PN index. The channel information can also provide flagging information, to e.g. identify
filler packets.
Design decision:
Use DP/ETH. Together the CP CRC and ETH CRC ensure that for the lifetime of LOFAR2.0 packets
with correct CRC will not have false positives. Use a bit in the channel field to indicate
filler packets.
What is the DP/ETH packet overhead?
- The ETH packet overhead consists of:
. Add 8 octets (c_network_eth_preamble_len) for Ethernet preamble . Add 8 octets (c_network_eth_preamble_len) for Ethernet preamble
. Add 14 octets for the ETH header that contains destination MAC (6), source MAC (6) and Ethernet type (2) . Add 14 octets for the ETH header that contains destination MAC (6), source MAC (6) and
Ethernet type (2)
. Add 2 octets to pad the ETH header to align to 8 byte word boundary . Add 2 octets to pad the ETH header to align to 8 byte word boundary
. Add 4 octets for CRC . Add 4 octets for CRC
. Add 12 octets (c_network_eth_gap_len) for Ethernet gap size between packets . Add 12 octets (c_network_eth_gap_len) for Ethernet gap size between packets
= 8 + 14 + 2 + 4 + 12 = 40 octets = 8 + 14 + 2 + 4 + 12 = 40 octets
- The DP packet overhead consists of (dp_packet_enc_crc / dp_packet_dec_crc):
. Add 4 octects for CHAN (32b)
. Add 8 octects for Sync & BSN (64b)
. Add 4 octects for ERR (32b)
. Add 4 octects for CRC (32b)
= 4 + 8 + 4 + 4 = 20 octets
Design decision: The DP/ETH packet overhead is P_overhead = 60 octets.
Use one packet type per ring lane:
This avoids having to multiplex different packet types onto a single lane. Still the Ethernet type
can be used to fill in the packet type to more easily identify data on different lanes of the ring.
How many transceivers are needed for the ring? How many transceivers are needed for the ring?
There are four data types beamlets, crosslets, subband offload and transient buffer read out. The data loads are: The ring uses 4 of the 12 available transceivers, to match the QSFP cable link that is needed to
- 488 beamlets (R_os = 1 --> W_beamlet_sum = 24 bit, R_os = 1.25 --> W_beamlet_sum = 19.2 ~= 20 bit) connect the ring between UniBoard2.
There are four data types beamlets, crosslets, subband offload and transient buffer read out. The
data loads are:
- 488 beamlets (R_os = 1 --> W_beamlet_sum = 18 bit, R_os = 1.28)
- ~10 crosslets (R_os = 1 --> 15 crosslets, R_os = 1.25 --> 12 crosslets) - ~10 crosslets (R_os = 1 --> 15 crosslets, R_os = 1.25 --> 12 crosslets)
- ~ subbands (R_os = 1 - subbands (R_os = 1
- - << 10Gbps transient buffer data
Link monitoring:
The link should be monitored during normal operation and to avoid the need to define and control a
test packet (e.g. like ping). The link monitoring should directly identify the source of a error
(e.g. tx node, link, rx node).
Design decision: Use DP/ETH packets to monitor the link quality.
Choose to transport one data type packet per 10GbE link direction. *******************************************************************************
* Ring usage:
*******************************************************************************
OSI 5 Session layer:
OSI 6 Presentation layer:
OSI 7 Application layer:
The ring can be used in both directions. The forward direction is e.g. from PN0 to 15, the backward direction is e.g.
from PN 15 to 0. The ring uses 4 of the 12 available transceivers, to match the QSFP cable link that is needed to connect
the ring between UniBoard2.
The ring function has the following sub functions: The ring function has the following sub functions:
- Receive packets from ring (and remove CRC field) - Receive packets from ring (and remove CRC field)
- Discard incorrect packets (based on CRC) - Discard incorrect packets (based on CRC)
- Pass on transit packets (Destination MAC > PN index for forward ring, MAC < PN index for backward ring) - Pass on transit packets (Destination MAC > PN index for forward ring, MAC < PN index for backward
ring)
- Decode packets (get packet from ring for internal use) - Decode packets (get packet from ring for internal use)
- Encode packets (put internal packet onto ring) - Encode packets (put internal packet onto ring)
- Multiplex local and transit packets - Multiplex local and transit packets
- Transmit packets onto ring - Transmit packets onto ring
- Monitor Rx and Tx packets
- Align packets for processing (use filler data on inputs with lost packets)
Use 10GbE or 40GbE: Ring access schemes:
From the low-latency Ethernet core user guides it follows that the Ethernet core with statistics registers use:
10GbE core : 4300 FF, 4 M9K
40GbE core : 21200 FF, 13 M20K
The synhesis fitter results from Apertif BF and XC show that the tech_eth_10g takes about 5500 FF and 4 (BF) or 7 (XC) M9K.
The BF MAC has no statistics, the XC MAC does have statistics.
Hence the 40GbE core is about a factor 4 larger than the 10GbE core, so from a resource usage point of view it is does not matter
whether we use 4 x 10GbE or 1 x 40GbE. The advantage of 40GbE is that it can fit data rates > 10Gbps per data type stream. The
advantage of using 10GbE is that we can use one link per data type stream and thereby avoid having to multiplex different data
streams onto the same 40GbE link. However some multiplexing of local packets and remote transit packets can also be needed.
UniBoard2 has been tested with 10GbE but not yet with 40GbE.
The Arria10 on UniBoard2 has 1708800 FF so 1708800 / 182400 = 9.3 times more than the Stratix IV on UniBoard1. On UniBoard2 one
10GbE interface uses maximum about 5500 / 1708800 = 0.32 % of the FF and maximum about 7 / 2713 = 0.25% of the block RAM.
In total there will be 4 x 10GbE for the intra board ring, 4 x 10GbE for the inter board ring and 1 x 10GbE for external IO, so
these will take about 3% of the FF and block RAM resources.
The packet rate is f_sub = 195312.5 Hz. At 10GbE this means that the maximum packet size is 10e9/195312.5 = 6400 octets. For
oversampled subbands the maximum packet size drops to about 6400 / 1.25 = 5120 octets. If the minimum packet size is e.g. 4000
octets, then at 10GbE this means that the link cannot be fully used, whereas at 40GbE multiple packets will still fit. The
maximum packet size for 10GbE also depends on the number of packets on the ring:
. With one packet on the ring the maximum packet size for 10GbE is 5120 (R_os = 1.25) octets,
. With N = 16 nodes and all nodes sending to the same end node the maximum packet size is 5120/16 = 320 (R_os = 1.25) octets,
. If the packet only needs to travel N/2 nodes then the maximum packet size is 5120/8 = 640 (R_os = 1.25) octets.
Design descision: Assume the ring will use 4 x 10GbE, because it is known technology and suitable.
- 1) start node sends packet to end node, intermediate nodes modify the packet.
- 2a) each node starts sending its packets to an end node, intermediate nodes pass on the packet
- 2b) each node starts sending its packets to an end node, intermediate nodes pass on the packet
and use the packet (= multi cast)
If both scheme 1 and 2 are suitable, then scheme 1 typically yields a larger payload, because it
reserves slots for all nodes, whereas the payload for scheme 2 only contains data from one node.
Scheme 1 and 2b are useful if the transit nodes also use or modify the packet data. The multiple
hops are then used to multi cast the data. Scheme 2a is suitable for packet transport from start
to end node, whereby transit nodes only pass on the packet.
For the beamformer beamlets scheme 1 is most suitable. The start node prepares the packet with
the initial beamlet sums. The subsequent nodes add there local beamlet sum to the packet
beamlet sums and then pass on the packet.
For the subband correlator both scheme 1 and scheme 2b are suitable. For scheme 1 the start node
creates a packet with slots for all nodes and fills in its own slot with its crosslets. Scheme 1
was used in LOFAR 1.0. The subsequent nodes fill in their slots with their crosslets and also
use the packets to correlate the remote crosslets with their local crosslets. With scheme 2b
each node creates a packet with its own crosslets and sends it to N/2 nodes further. The
intermediate node pass on or remove the packets and use the packets to correlate the remote
crosslets with their local crosslets. The disadvantage of scheme 1 is that it requries a
dedicated start node that initiates the aggregate packet. With scheme 2b each node acts as start
node for its own packet. Intermediate nodes use the remote packets for correlation and pass
them on. The final destination node removes the packet.
For the subband offload both scheme 1 and scheme 2a are suitable. For scheme 1 the start node
creates a packet with slots for all nodes and fills in its own slot with its subbands. The
subsequent nodes fill in their slots with their subbands. With scheme 2a each node creates a
packet with its own subbands and sends it to the output end node. The other nodes only pass on
the remote packets.
For transient buffer read out scheme 2a is most suitable to gather the read out data from each
node at the output end node.
Use one packet type per ring link.
This avoids having to multiplex different packet types onto a single link. Still the Ethernet type can be used to fill
in the packet type to more easily identify data on different links of the ring.
Use application packets to monitore the link quality: Ring access directions:
This allows monitoring the link during normal operation and avoids the need to define and control a test packet (e.g. The ring can be used in both directions. The forward direction is e.g. from PN0 to 15, the
like ping). backward direction is e.g. from PN 15 to 0 for N = 16 nodes.
All schemes can be used in two directions for the same type of data transport. In one direction
the maximum number of hops between start and end node is N-1, while by using both directions the
maximum number of hops between start and end node is N/2. If the data is used on all intermediate
nodes, then there is no advantage to use the ring in both directions. If the data is only passed
along by intermediate nodes, then the link capacity is used about a factor two more efficiently
by sending data in both directions. Disadvantages of using the ring in both directions for the
same type of data are that each node needs to decide which direction to use, that the data arrives
from both directions at the end node, and that it is somewhat more difficult to understand and
diagnose.
Wormhole routing or store-and-forward routing: Design decision : Therefore choose to use the ring in only one direction per link.
With worm hole routing a received packet or a received and modified packet is already transmitted, while the tail of
the packet is still being received. The advantage of wormhole routing is that it minimizes the latency along the ring
and therefore also local buffering to align between local and remote data. The disadvantage of wormhole routing is
that a CRC error on the received packet needs to be propagated by forcing the CRC of the transmitted packet to be
wrong. This implies that all subsequent hops will show this CRC error. For link diagnoses this is confusing, because
the subsequent links did not cause the CRC error. With store-and-forward routing a packet is first received entirely
before it is passed on for transmit. This allows to discard a received packet with a CRC error, but does increase the
latency on the ring. For LOFAR 2.0 choose to use store-and-forward, because it allows discarding packets with CRC
errors when they occur and because there is sufficient internal block RAM to buffer the local data for the worst case
ring latency.
Ring latency:
The latency of 1 hop is about 0.2 us. The time to transmit one Ethernet frame of 1500 octets at 10Gbps is about 1.2 us Use one link per packet type:
and a jumbo frame of 6400 octets takes about 5.12 us (= T_sub). Hence for packets >~ 300 octets the ring latency is For scheme 2 use only one link for all source nodes, so do not let different source nodes use
dominated by the store and forward routing at each node. The 10GbE Ethernet MAC uses 64 bit data. At 200 MHz this can different links. For N/2 = 8 or N = 16 the number of links would become too large. By using one
achieve 64 * 0.2 = 12.8 Gbps. Hence if the processing operates without data valid gaps, then the Ethernet transmit link for all sources, increasing the processing becomes a matter of using and instantiating more
will not run empty during a payload. Therefore it is not necessary to use a fill FIFO, which would add to the ring links.
latency. For a packet that travels the entire ring the latency is then about (N-1) * T_sub and the corresponding
FIFO depth to align the local data with this remote data is (N-1) * packet size.
Only accept correct packets: Remote and local data alignment:
Discard all packets that have a CRC error. This also prevents that packets of wrong length enter the internal In APERTIF the data arrived from >= 2 remote streams. With the LOFAR ring there is always local
processing. The Ethernet CRC error is 32 bit, so it is very unlikely that packet with errors still has a data that arrives first and needs to be aligned with only one remote data stream. The local data
correct CRC. With wormhole routing it was necessary to limit or extend a packet to a known fixed length, because needs to be buffered until the remote data from the farthest PN has arrived. The latency on the
also packets with CRC error are passed on. With store-and-forward routing the CRC provides sufficient protection ring is about 1 packet per transit hop, due to the store-and-forward. The first hop has negligible
to ensure that only correct packets enter the application. latency. Hence with H hops the local data buffer size needs to be (H-1) * local data size.
Ring data transport schemes: Ring data transport schemes:
- beamlets on ring: l --> r+l --> r+l --> ... --> r+l - beamlets on ring: l --> r+l --> r+l --> ... --> r+l
...@@ -200,13 +357,10 @@ Ring data transport schemes: ...@@ -200,13 +357,10 @@ Ring data transport schemes:
. output filler data if remote got lost, to preserve nominal output rate to CEP . output filler data if remote got lost, to preserve nominal output rate to CEP
- crosslets on ring: rrrrrrrr,l --> rrrrrrrr,l --> ... --> rrrrrrrr,l - crosslets on ring: rrrrrrrr,l --> rrrrrrrr,l --> ... --> rrrrrrrr,l
. on each node separately align N/2 pairs of inputs l,r, have one pair per XC cell . on each node first align all inputs l,N/2*r, and then split into N/2 pairs of l,r to have one
or pair per XC cell (or on each node separately align N/2 pairs of inputs l,r, have one pair per
. on each node first align all inputs l,N/2*r, and then split into N/2 pairs of l,r to have one pair per XC cell XC cell). output filler data if remote got lost, and use zero to not disturb the intergation
. discard output data if remote got lost, to count number of active blocks per integration sync interval and count unflagged blocks to know the number of active blocks per integration sync interval.
or
output filler data if remote got lost, and use zero to not disturb the intergation and count unflagged blocks
to know the number of active blocks per integration sync interval
- subbands on ring: l, rl, rrl, rrrl, ..., rrrrrrrrrrrrrrrl - subbands on ring: l, rl, rrl, rrrl, ..., rrrrrrrrrrrrrrrl
. on final node align all l,(N-1)*r inputs . on final node align all l,(N-1)*r inputs
...@@ -216,56 +370,6 @@ Ring data transport schemes: ...@@ -216,56 +370,6 @@ Ring data transport schemes:
. no align, readout from one node at a time . no align, readout from one node at a time
Ring access schemes:
- 1) start node sends packet to end node, intermediate nodes modify the packet.
- 2a) each node starts sending its packets to an end node, intermediate nodes pass on the packet
- 2b) each node starts sending its packets to an end node, intermediate nodes pass on the packet and use the packet (= multi cast)
If both scheme 1 and 2 are suitable than scheme 1 typically yields a larger payload, because it reserves slots for all
nodes, whereas the payload for scheme 2 only contains data from one node. Scheme 1 and 2b are useful if the transit nodes
also use or modify the packet data. Scheme 2a is suitable for packet transport from start to end node, whereby transit
nodes only pass on the packet.
For the beam former beamlets scheme 1 is most suitable. The start node prepares the packet with the initial beamlet sums.
The subsequent nodes add there local beamlet sum to the packet beamlet sums and then pass on the packet.
For the subband correlator both scheme 1 and scheme 2b are suitable. For scheme 1 the start node creates a packet with
slots for all nodes and fills in its own slot with its crosslets. Scheme 1 was used in LOFAR 1.0. The subsequent nodes fill in
their slots with their crosslets and also use the packets to correlate the remote crosslets with their local crosslets.
With scheme 2b each node creates a packet with its own crosslets and sends it to N/2 nodes further. The intermediate node
pass on the packets and use the packets to correlate the remote crosslets with their local crosslets.
For the subband offload both scheme 1 and scheme 2a are suitable. For scheme 1 the start node creates a packet with slots for all
nodes and fills in its own slot with its subbands. The subsequent nodes fill in their slots with their subbands. With scheme 2a
each node creates a packet with its own subbands and sends it to the output end node. The other nodes only pass on the remote packets.
For transient buffer read out scheme 2a is most suitable to gather the read out data from each node at the output end node.
Ring access directions:
All schemes can be used in two directions for the same type of data transport. In one direction the maximum number
of hops between start and end node is N-1, while by using both directions the maximum number of hops between start
and end node is N/2. If the data is used on all intermediate nodes, then there is no advantage to use the ring in
both directions. If the data is only passed along by intermediate nodes, then the link capacity is used
about a factor two more efficiently by sending data in both directions. Disadvantages of using the ring in both
directions for the same type of data are that each node needs to decide which direction to use, that the data
arrives from both directions at the end node, and that it is somewhat more difficult to understand and diagnose.
Design decision : Therefore choose to use the ring in only one direction per link.
Use one link per packet type:
For scheme 2 use only one link for all source nodes, so do not let different source nodes use different links. For
N/2 = 8 or N = 16 the number of links would become too large. By using one link, increasing the processing becomes
a matter of using and instantiating more links.
Remote and local data alignment:
In APERTIF the data arrived from >= 2 remote streams. With the LOFAR ring there is always local data that arrives
first and needs to be aligned with only one remote data stream. The local data needs to be buffered until the remote
data from the farthest PN has arrived. The latency on the ring is about 1 packet per transit hop, due to the store
and forward. The first hop has negligible latency. Hence with H hops the local data buffer size needs to be (H-1) *
local data size. When the remote data arrive the local data is popped from the buffer. It the remote data has not
arrived in time, then the local data is popped from the buffer when the next local data is pushed into the buffer.
******************************************************************************* *******************************************************************************
...@@ -273,139 +377,211 @@ arrived in time, then the local data is popped from the buffer when the next loc ...@@ -273,139 +377,211 @@ arrived in time, then the local data is popped from the buffer when the next loc
******************************************************************************* *******************************************************************************
What is the beamlet packet size? What is the beamlet packet size?
The beamlet sum is passed on along the ring from start PN to end PN using ring access scheme 1. At the end PN the The beamlet sum is passed on along the ring from start PN to end PN using ring access scheme 1. At
final beamlet sum is scaled to W_beamlet = 8 bit and output to CEP. The intermediate beamlet sum has W_beamlet = the end PN the final beamlet sum is scaled to W_beamlet = 8 bit and output to CEP. The intermediate
18 bit and is complex. There are N_pol * S_sub_bf = 2 * 488 = 976 beamlets per packet. The payload size is beamlet sum has W_beamlet = 18 bit and is complex. There are N_pol * S_sub_bf = 2 * 488 = 976
N_pol * S_sub_bf * N_complex * W_beamlet_sum / W_byte = 2 * 488 * 2 * 18 / 8 = 4392 octets. The effective packet beamlets per packet. The payload size is N_pol * S_sub_bf * N_complex * W_beamlet_sum / W_byte =
size is 40 + 4392 = 4432 octets. With f_sub = 195312.5 Hz and R_os = 1.28 the data rate is 4432 * 195312.5 * 1.28 2 * 488 * 2 * 18 / 8 = 4392 octets. The effective packet size is 60 + 4392 = 4452 octets. With
* 8 = 8.864 Gbps, which fits on a 10GbE link. f_sub = 195312.5 Hz the data rate is 4452 * 195312.5 * 8 = 6.95625 Gbps < L_lane = 7.8125, so it
fits on a 10GbE lane.
Packet decoding and encoding: Packet decoding and encoding:
The start node encodes the packet and the end node decodes the packet. The intermediate nodes could operate on The start node encodes the packet and the end node decodes the packet. The intermediate nodes could
the encoded packet, however the payload beamlets are packed into bytes and are not word aligned. Therefore the operate on the encoded packet, however the payload beamlets are packed into bytes and are not word
intermediate nodes also need to decode the packet to be able to update the payload data, and then encode the aligned. Therefore the intermediate nodes also need to decode the packet to be able to update the
packet. The decode and encode function is available in any node, because all nodes run the same firmware image. payload data, and then encode the packet again. The decode and encode function is available in any
Therefore the decoding and encoding at intermediate nodes can reuse the encoding function of the start node and node, because all nodes run the same firmware image. Therefore the decoding and encoding at
the decode function of the end node, so no extra logic is needed. intermediate nodes can reuse the encoding function of the start node and the decode function of the
end node, so no extra logic is needed.
Ring adder payload processing: Ring adder payload processing:
The station beam is a dual polarization beam and each beam has S_sub_bf = 488 beamlets, so in total there are The full band station beam has S_sub_bf = 488 beamlets per polarization, so in total there are
976 complex beamlets per subband period of N_fft = 1024 cycles @ 200 MHz. For an oversampled filterbank with N_pol * S_sub_bf = 2 * 488 = 976 complex beamlets per subband period of N_fft = 1024 cycles @
R_os = 4/3 there are N_fft / R_os = 768 cycles @ 200 * R_os MHz. Hence to be compatible with an oversampled 200 MHz. For an oversampled filterbank with R_os > 1 the processing rate is increased to
filter bank the beamformer cannot process all 976 beamlets in series, instead it has to apply ceil(R_os) = 2 200 * R_os MHz, so there are still N_fft = 1024 cycles available to process 976 beamlets. The ring
streams in parallel that each process 488 beamlets. Therefore to support the oversampled beamlets the paylaod adder adds the local beamlet sum to the received beamlet sum and passes on the result. The beamlet
needs to be encoded from and decoded to two streams of beamlets: sum is received as a packet with 64 bit packed data at 156.25 MHz (64 * 156.25M = 10G). The 976
beamlets fit in 976 * 18b * 2 / 64b = 549 64b words. The packet is processed at 200 * R_os MHz.
. from 10GbE -->
. Rx packet 64b @ 156MHz --> Rx FIFO to dp_clk domain -->
. Rx packet 64b @ 200MHz --> DP/ETH decode to discard or extract payload of 549 words-->
. Rx payload 64b @ 200MHz --> repack 549 words to 976 beamlets -->
. Align remote and local beamlets -->
. Sum remote and local beamlets --> repack 976 beamlets to 549 words -->
. Tx payload 64b @ 200MHz --> DP/ETH encode to add header and tail -->
. Tx packet 64b @ 200MHz --> Tx FIFO to tx_clk domain -->
. Tx packet 64b @ 156MHz -->
. to 10GbE
? Does align belong to ring or to beamlet ring adder?
--> to beamlet ring adder:
- to avoid having an align input and output on the ring interface.
- implies that align monitor also belongs to beamlet ring adder
? Does sum belong to ring or to beamlet ring adder or to local beamformer?
--> to beamlet ring adder:
- it deserves a dedicated block, because it is art of the BF (so not of the ring) and it only
adds (so does not have BF weigths like the local BF).
0 : 0 2 4 ............. 974
1 : 1 3 5 ............. 975
The 10Gbps data on the ring interface is available as 32 bit data at 312.5 MHz (32 * 312.5M = 10G).
Local beamlet sums FIFO size: Local beamlet sums FIFO size:
The local subband data needs to be buffered until the beamlet sum arrives. The last node experiences the largest The local subband data needs to be buffered until the beamlet sum arrives. The size of the buffer
latency, because then the beamlet sum has travelled N-1 hops, each adding about 5888 * 8 / 10G = 4.71 us. The is determined by last node, because then the beamlet sum has travelled N-1 hops. For each hop the
total latency for the LBA ring is (16 - 1) * 4.71 us = 70.6 us or about 14 T_sub. With some extra margin assume packet is delayed by:
that the last N-1 or N local beamlets need to be buffered. Per PN this yields a FIFO size of N_pol * S_sub_bf * - packet encoding
N * N_complex * W_subband = 2 * 488 * 16 * 2 * 18 = 562176 bit, which takes about 32 M20k block RAMs. - packet transport over the ring
After each hop the packet is delayed by:
- store-and-forward to be able to check the CRC
- packet decoding
- packet processing
The store-and-forward causes a latency of one block period (T_sub) per hop and is the dominant
factor in the latency. During this latency N-1 local blocks need to be buffered. Assume that the
processing and transport delays are shorter than one block period, so buffering one extra local
block is sufficient to compensate it. Per PN this yields a FIFO size of N_pol * S_sub_bf * N *
N_complex * W_subband = 2 * 488 * 16 * 2 * 18 = 562176 bit, which takes about 32 M20k block RAMs.
Ring modes: Ring modes:
- off - off
- local - local
- remote - remote
- combine - combine
With dp_bsn_align all these modes are supported by enabling/disabling the corresponding inputs.
FIFO flush:
A FIFO can be flushed by resetting it, but this requires careful control to ensure that the reset is noticed
in both clock domains, and that the reset is applied in between input packets to avoid that only a tail
of a packet gets into a FIFO. Therefore in LOFAR 1.0 and APERTIF a FIFO is flushed by reading the packets
from it until it is empty. This scheme also allows flushing per packet. The disadvantage of reading the
packets and the discard them, is that it takes as long as reading at full speed.
Lost remote packet detection:
Local FIFO full:
The local FIFO needs to buffer the local data to be able to align with the remote data. The latency between
nodes depends on the number of hops. With N = 16 nodes and store and forward packet transport the maximum
latency will be < N * T_sub. To compensate for this latency the local FIFO needs to be able to store at most
about N local packets. If the FIFO runs full, then this is an indicator that remote packets got lost and
then the local FIFO needs to be flushed until it is empty.
Rx timeout:
The average packet rate on the ring is f_sub, so within T_sub there should arrive a new packet. If no packet
arrives within T_sub, then the local FIFO can flush one packet. In this way the local FIFO does not need to
be flushed until empty and less packets will get lost once the remote packets arrive again. Using Rx timeout
does rely on that packets fit within a T_sub interval and that every T_sub interval contains at least part
of a packet, so the actual packet rate must be close to the average packet rate.
Remote packets:
The remote packets drive the ring adder and are processed on arrival. The local packet with the same time stamp
is already pending in the local beamlets FIFO. If a burst of remote packet gets lost, then the node will
notice this because its local beamlets keep arriving and will overflow the local beamlets FIFO. The node will
read and discard packets from the local beamlets FIFO to make sure that the FIFO does not overflow. If only
one or a few remote packets got lost, then the node will noticethis during the time stamp alignment, but
only as soon as the next packet has arrived. This next packet will be ahead of the local packet, so the local
packets need to be flushed. The node will then read and discard packets from the local beamlets FIFO until it
can align the remote and local data. During this realignment process the next remote packet may already arrive
as well. Therefore the remote packet needs to be buffered, or discarded. Assume the FIFO is flushed by reading
and then discarding packets from it. The local packets and the remote packets arrive at the same rate. If the
flushing of the packets goes faster then reading them, because flushing can use all clock cycles. The flushing
can only catch up if the gaps between packets are large enough. Therefore in LOFAR 1.0 the remote packets were
discarded during the flushing. This does mean that when one packet gets lost, the flushing will also discard
the next packet and some more for as long as it takes to empty the local beamlets FIFO. An alternative would
be to keep on flushing and discarding remote packets, until the local beamlet FIFO is again ahead of the
remote packets. Typically packets will get lost rarely or in bursts. In both cases it is fine to just flush
the local beamlet FIFO until it is empty.
PN0 PN1 PN2 PN3 PN4 With dp_bsn_align_v2 all these modes are supported by enabling/disabling the corresponding inputs.
t
0: L0 L1 L2 L3 L4 <-- S_sub_bf = 488 beamlets (dual pol complex) per packet
R4 R0 R1 R2 R3
R3 R4 R0 R1 R2
The beamformer function has the following sub functions: The beamformer function has the following sub functions:
- "Beamlet subband select" : Select S_sub_bf = 488 subbands per signal input - "Beamlet subband select" : Select S_sub_bf = 488 subbands per signal input
- "Local beamformer" : Form N_pol * S_sub_bf = 2 * 488 = 976 local beamlet sums for S_pn = 12 signal inputs - "Local beamformer" : Form N_pol * S_sub_bf = 2 * 488 = 976 local beamlet sums for
S_pn = 12 signal inputs
- "Beamlet ring adder" : - "Beamlet ring adder" :
if start node: if start node:
- Encode beamlet sums packet to ring - Encode local beamlet sums packet to ring
else: else:
- Buffer the local beamlet sums for >= N subband intervals - Buffer the local beamlet sums for ~= N subband intervals
- Decode remote beamlet sums packet from ring - Decode remote beamlet sums packet from ring
- Align remote beamlet sums packet and local beamlet sums packet - Align remote beamlet sums packet and local beamlet sums packet
- Add local beamlet sums to remote beamlet sums packet - Add local beamlet sums to remote beamlet sums packet
if transit node: if transit node:
- Encode beamlet sums packet to ring - Encode beamlet sums packet to ring
else: else:
- "Beamlet data output" : Scale and output beamlet sums - "Beamlet data output" : On output node scale and output final beamlet sums
- "Beamlet statistics (BST)": Calculate BST - "Beamlet statistics (BST)": Calculate BST for beamlet sums, output node has final BST
******************************************************************************* *******************************************************************************
* Subband Correlator * Subband Correlator
******************************************************************************* *******************************************************************************
Crosslet transport scheme: With transport scheme 1 crosslets from different source nodes are combined into one packet.
Use transport scheme 2b with N/2 hops where every node sends its local crosslets N/2 hops. The remote crosslets Scheme 2b packs only local crosslets into a packet. Compared to scheme 1, scheme 2b:
are correlated with the local crosslets. The remote crosslets arrive in packets from the N/2 preceding nodes. - treats the local crosslets and remote independently
First the local crosslets are correlated with themselves and then the local crosslets are kept in a barrel shifter, - has small payload and thus more packet overhead, but the load still fits on a lane
such that they can also be correlated with the remote crosslets that arrive in the packets. - has small payload that can be enlarged by transporting more local crosslets, to support
- count N_int for monitoring a subband correlator with N_crosslets > 1.
Design decision:
Use transport scheme 2b with N/2 hops where every node sends its local crosslets N/2 hops,
because it is more flexible to have only local crosslets per packet.
Square correlator cell:
There are S_pn = 12 local crosslets. A packet contains S_pn = 12 remote crosslets. There are N/2 remote crosslet
packets. The local crosslets have to be correlated with the local crosslets and with each of the remote crosslet
packets. The correlation with the local crosslets is a square matrix that yields X_sq = S_pn * S_pn = 144 visibilities.
Number of square correlator cells per PN: Number of square correlator cells per PN:
With N = 16 PN for LBA there are N/2 = 8 remote crosslet packets. Hence together with the local crosslet visibilities There are S_pn = 12 local crosslets. A packet contains S_pn = 12 remote crosslets. There are N/2
this yields X_pn = (N/2 + 1) * X_sq = (8 + 1) * 144 = 1296 visibilities per PN. remote crosslet packets. The local crosslets have to be correlated with the local crosslets and
with each of the remote crosslet packets. The correlation with the local crosslets is a square
matrix that yields X_sq = S_pn * S_pn = 144 visibilities. For the local-local square correlator
cell the efficiency is (S_pn * (S_pn+1)) / 2 / X_sq = 54%, but for the N/2 other local-remote
square correlator cells the efficiency is 100 %. With N = 16 PN for LBA there are N/2 = 8 remote
crosslet packets. Hence together with the local crosslet visibilities this yields
X_pn = (N/2 + 1) * X_sq = (8 + 1) * 144 = 1296 visibilities per PN.
Number of multipliers per crosslet:
The subband correlator needs to finished within one subband period, so within N_fft = 1024 clock
cycles. The X_pn = 1296 visibililies per PN can be caluculated using one complex multiplier if
the complex multiplier runs at 1296 / 1024 * 200 M > 253 MHz. For an oversampled filterbank with
R_os <= 1.28 this requires 324 MHz, which is too much. All X_pn = 1296 can be calculated using
two complex multipliers running at > 161 MHz. However another option is to use one pultiplier
per X_sq = 144 visibilities, so one complex multiplier per correlator cell and N/2 + 1 = 9
correlator cells in parallel. The FPGA has sufficient multipliers to support this scheme and the
spare capacity of each correlator cell can be used to support a subband correlator with more
than 1 subband per integration interval, so N_crosslets > 1.
Design decision:
Use 1 + N/2 parallel correlator cells, for the local-local visibilities and for the local-
remote visibilitie for each remote source.
What is the crosslet packet size?
With S_pn = 12 signal inputs per PN and one crosslet per signal input there are 12 crosslets per
packet. A crosslet is a W_crosslet = 16 bit complex value, so 12 * 4 = 48 octets payload, so the
effective packet size is p_packet = 60 + 48 = 108 octets. The relative packet overhead for single
crosslet payloads is P_overhead / P_packet = 60 / 108 = 55 %.
Maximum number of crosslets per lane:
There are f_sub = 195312.5 subbands per s, and the packets have to travel N/2 hops. This yields
a packet load of P_packet * f_sub * N/2 = (108 * 8b) * 195312.5 * 16 / 2 = 1.35 Gbps. The data
load of only the payload data is payload size * f_sub * N/2 = (48 * 8b) * 195312.5 * 16 / 2 =
0.6 Gbps. Hence the small packet size causes a large packet overhead, but is still acceptable,
since it is < L_lane = 7.8125 Gbps so it fits on a single 10G lane of the ring.
Multiple local crosslets could be transported via seperate packets, a lane can then fit about
7.8125 / 1.35 ~= 5 different crosslets. Packing the local crosslets into a single payload
reduces the packet overhead. The maximum number of crosslets per packet follows from
(P_overhead + X * 48 * 8b) * f_sub * N/2 < L_lane. For N = 16 this yields X ~=
(7.8125 Gbps / (16/2) / 195312.5 - 60) / (48 * 8b) = 12. With 12 crosslets the payload size is
16 * 48 = 576 and the effective packet size is P_packet = 60 + 576 = 636 octets. The relative
packet overhead for multi crosslet payloads is P_overhead / P_packet = 60 / 636 ~= 9.4%. The
packet load for multi crosslet payloads is (636 * 8b) * 195312.5 * 16/2 = 7.95 Gbps <
L_lane = 7.8125 Gbps, so this just not fits on a 10GbE lane, due to the still significant packet
overhead. Using X = 11 instead of 12 crosslets per packet yields a total crosslet packet load
per lane of ((60 + 11 * 48) * 8b) * 195312.5 * 16/2 = 7.35 Gbps, which does fit on a lane.
Design decision:
Pack local crosslets into a single payload if N_crosslets > 1, because then teh packet overhead
is much reduced to support transporting more crosslets per lane (11 instead of 5).
Maximum number of crosslets per correlator cell:
A X_pn correlator cell can correlate N_fft / X_sq = 1024 / 144 = 7 different crosslets frequencies.
With N = 16 for LBA, there need to be N/2 + 1 = 9 of these X_pn correlator cells in parallel. One
X_pn correlates the local-local crosslets and the other N/2 X_pn correlates the local-remote
crosslets. These 9 X_pn in parallel can correlate up to 7 crosslets. The link can transport
maximum 11 crosslets. Hence the processing capacity of 9 X_pn is less than the IO capacity of 1
10GbE lane, therefore 9 X_pn in parallel can correlate 7 different crosslets. The crosslet data rate
on a lane is then ((60 + 7 * 48) * 8b) * 195312.5 * 16/2 = 4.95 Gbps, so a utilization of 4.95 /
7.8125 = 63 %. Another set of 9 X_pn could be used to correlate the remaining 11- 7 = 4 crosslets
that can be transported via the ring. However, if more than N_crosslet = 7 crosslets need to be
correlated in parallel per integration, then it is easier to allocate an extra lane and to
instantiate an extra set of 9 X_pn to correlate 14 crosslets in parallel in total.
One X_pn takes one complex multiplier. For N_crosslets = 1 crosslet per integration interval using
N/2+1 = 9 X_pn uses only 144 / 1024 = 14% of the processing resources. However this is acceptable
because:
- the FPGA has sufficient multipliers
- it provides a clear design
- the spare capacity can be used to process more crosslets per integration interval
Design decision:
Use 1 + N/2 = 9 parallel correlator cells to correlate N_crosslets = 1 crosslet, or upto 7
crosslets in parallel, per integration interval.
Send more than one time slot per packet?
To reduce the relative packet overhead for single crosslet XC it is an option to put multiple
time slots per payload. Design decision: This is considered to complicated.
What if a packet gets lost?
The local crosslets cannot get lost, but remote packets may get lost. The BSN aligner will repolace
lost remote packets with filler packets that are flagged. The crosslets in the filler packets
contain zero data, so in the correlator they do not contribute to the visibilities. Each correlator
cell has to count the number of valid and flagged crosslets per integration interval. The number
valid crosslets N_valid can be used to weight the visibility relative to the expected number of
N_int crosslets. The number of flagged crosslet N_flagged is used for monitoring. For every
integration interval N_valid + N_flagged = N_int, by design of the BSN aligner.
Crosslet period:
The subband correlator needs to finished within one subband period, so T_xc < T_sub. For the critically sampled
filterbank the subband period is N_fft = 1024 sample periods. The X_pn = 1296 visibililies per PN can be
caluculated using one complex multiplier if the multiplier runs at 1296 / 1024 * 200 M > 253 MHz. For an oversampled
filterbank with R_os <= 1.25 this requires 1.25 * 253 = 317 MHz, which may be too much.
Time in diagrams: Time in diagrams:
- equal time for all PN in same row and in same relative column - equal time for all PN in same row and in same relative column
...@@ -441,81 +617,11 @@ N_int-1: <-- Dump and restart XST: ...@@ -441,81 +617,11 @@ N_int-1: <-- Dump and restart XST:
- not calculated because conj() - not calculated because conj()
What is the crosslet packet size?
With S_pn = 12 signal inputs per PN and one crosslet per signal input there are 12 crosslets per packet. A crosslet is
a W_crosslet = 16 bit complex value, so 12 * 4 = 48 octets payload, so the effective packet size is 40 + 48 = 88 octets.
The relative packet overhead for single crosslet payloads is 40 / 88 = 45 %.
There are f_sub = 195312.5 subbands per s, and the packets have to travel N/2 hops. This yields a packet load of
packet size * f_sub * N/2 = (88 * 8b) * 195312.5 * 16 / 2 = 1.1 Gbps. The data load of only the payload data is
payload size * f_sub * N/2 = (48 * 8b) * 195312.5 * 16 / 2 = 0.6 Gbps. Hence the small packet size causes a large
packet overhead, but is still acceptable, since it fits on a single 10G link of the ring.
Calculate one or multiple crosslets:
With small payloads the 10G link could fit about 10/1.1 ~= 8 different crosslets. With larger payloads the 10G link
could fit about 10 / 0.6 = 16 crosslets. The advantage of using small payloads is that adding more crosslets can be done
by instantiating the same single crosslets XC multiple times. However the small packets do have to travel sequentially
via the same 10G link, so there needs to be a multiplexer after that the local ETH frames have been made. The advantage of
using larger payloads is that they can be made by putting the extra crosslets in the same payload. With 16 crosslets
the payload size is 16 * 48 = 768 and the effective packet size is 40 + 768 = 808 octets. The relative packet overhead for
multi crosslet payloads is 40 / 808 ~= 5 %. The packet load for multi crosslet payloads is (808 * 8b) * 195312.5 * 16 / 2 =
10.1 Gbps, so this will just not fit on a 10GbE link, but 15 crosslets would.
At 200 MHz for the critically sampled subbands, a X_pn correlator cell can correlate N_fft / X_sq = 1024 / 144 = 7
different crosslets frequencies. With N = 16 for LBA there need to be N/2 + 1 = 9 of these X_pn correlator cells in
parallel. One X_pn correlates the local-local crosslets and the other N/2 X_pn correlates the local-remote crosslets.
These 9 X_pn in parallel can correlate up to 7 crosslets. The link can transport 15 crosslets, so 18 X_pn in parallel
could correlate 14 different crosslets to make better use of the link capacity.
One X_pn takes one complex multiplier. For one crosslet using N/2+1 = 9 X_pn is a waste of resources, but still
acceptable and providing a clear design.
Send more than one time slot per packet?
To reduce the relative packet overhead for single crosslet XC it is an option to put multiple time slots per payload.
This is considered to complicating.
PN0 PN1 PN2 PN3 PN1
t
0: L00 L11 L22 L33 L44 <-- For example two time slots per packet
R44 R00 R11 R22 R33
R33 R44 R00 R11 R22
2:
What if a node fails?
The next N/2 nodes will then miss packets. The order of the packets is not affected, because on each node it will
be the last one or more packets that are missed. There will be no correlations for the missed packets, but the
correlation should continue if the next time slot the node starts again. A packet count per packet source at each
node will reveal missed packets and thus also the number of integrations that happened in the final visibilities.
If no packets are missed then the packet count is 195312.5 per integration interval on every PN for every packet
source PN.
PN0 PN1 PN2 PN3 PN1
t
0: L0 . L2 L3 L4 <-- PN1 fails, so next N/2 nodes will miss packets
R4 . . R2 R3
R3 . . . R2
00 . 22 33 44
04 . . 32 43
03 . . . 42
What if a packet gets lost?
If a packet gets lots then it can cause a gap in the packet order, so the next packet must not be mistaken as
the lost packet. Therefore the packets must have a time slot number and a source number, such that the XST in
each node will use it for the correct visibilities.
PN0 PN1 PN2 PN3 PN1
t
0: L0 L1 L2 L3 L4
R4 . R1 R2 R3 <-- L0 from PN0 gets lost at PN1
R3 R4 . R1 R2
Packet order is guarantueed?
At the start of every time slot the local L# packet is send first. After that each node passes on the packets that
it receives. Therefore the packets arrive in order with packet from closest node first and from furtherst node
last. If a packet gets lost then there will be a gap, but the order is still preserved.
What if T_sq > T_hop latency on ring? What if T_sq > T_hop latency on ring?
What if T_sub > N/2 * T_hop latency on ring? What if T_sub > N/2 * T_hop latency on ring?
......
...@@ -182,15 +182,16 @@ git remote remove <remote name> # remove a remote repo ...@@ -182,15 +182,16 @@ git remote remove <remote name> # remove a remote repo
******************************************************************************* *******************************************************************************
Open issues: Open issues:
- Central HDL_IO_FILE_SIM_DIR = build/sim --> Project local sim dir - Central HDL_IO_FILE_SIM_DIR = build/sim --> Project local sim dir
- avs_eth_coe.vhd per tool version? Because copying avs_eth_coe_<buildset>_hw.tcl to $HDL_BUILD_DIR copies the - avs_eth_coe.vhd per tool version? Because copying avs_eth_coe_<buildset>_hw.tcl to $HDL_BUILD_DIR
last <buildset>, using more than one buildset at a time gices conflicts. copies the last <buildset>, using more than one buildset at a time gices conflicts.
******************************************************************************* *******************************************************************************
* To do: * To do:
******************************************************************************* *******************************************************************************
- Check that the Expert users (MB, SJW, MN), Maintainers (HM) and Local users are happy with the design decisions - Check that the Expert users (MB, SJW, MN), Maintainers (HM) and Local users are happy with the
design decisions
- H6 M&C loads section - H6 M&C loads section
- H3 Functions mapping - H3 Functions mapping
- H3/4 Timing (1s default, PPS, event message) - H3/4 Timing (1s default, PPS, event message)
...@@ -225,7 +226,7 @@ Open issues: ...@@ -225,7 +226,7 @@ Open issues:
- Update RadioHDL docs - Update RadioHDL docs
- Write RadioHDL article - Write RadioHDL article
- Write HDL RL=0 article - desp_hdl_design_article.txt - Write HDL RL=0 article - desp_hdl_design_article.txt
- XST : SNR = 1 per visibility for 10000 samples, brigthtest sourcre log 19.5 --> 4.5 dB --> T_int = 1 s is ok.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment