Merge branch 'master' into L2SDP-37

c162afef · Kenneth Hiemstra · ca7700e1 · 6e258b82 · c162afef · c162afef
Commit c162afef authored 4 years ago by Kenneth Hiemstra
--- a/applications/lofar2/doc/prestudy/station2_sdp_ring.txt
+++ b/applications/lofar2/doc/prestudy/station2_sdp_ring.txt
@@ -291,133 +291,6 @@ The beamformer function has the following sub functions:
 * Subband Correlator
 *******************************************************************************

-With transport scheme 1 crosslets from different source nodes are combined into one packet.
-Scheme 3 packs only local crosslets into a packet. Compared to scheme 1, scheme 3:
- treats the local crosslets and remote crosslets independently
- has small payload and thus more packet overhead, but the packet load still fits on a lane
- has small payload that can be enlarged by transporting more local crosslets, to support
-  a subband correlator with N_crosslets > 1 per integration interval.
-
-Design decision:
-  Use transport scheme 3 with N/2 hops where every node sends its local crosslets N/2 hops,
-  because it is more flexible to have only local crosslets per packet. 
-
-
-Number of square correlator cells per PN:
-There are S_pn = 12 local crosslets. A packet contains S_pn = 12 remote crosslets. There are N/2
-remote crosslet packets. The local crosslets have to be correlated with the local crosslets and
-with each of the S_lba - S_pn remote crosslet packets. The correlation with the local crosslets
-is a square matrix that yields X_sq = S_pn * S_pn = 144 visibilities. For the local-local square
-correlator cell the efficiency is (S_pn * (S_pn+1)) / 2 / X_sq = 54%, but for the N/2 other
-local-remote square correlator cells the efficiency is 100%. With N = 16 PN for LBA there are
-N/2 = 8 remote crosslet packets. Hence together with the local crosslet visibilities this yields
-X_pn = (floor(N/2) + 1) * X_sq = (8 + 1) * 144 = 1296 visibilities per PN. In total the subband
-correlator calculates N * X_pn = 16 * 1296 = 20736 visibilities. There are 
-S_lba * (S_lba + 1)/2 = 192 * 193 / 2 = 18528 unique visibilities. The difference 20736 - 18528
- 2208 is due to that:
-
-. for any N the N * S_pn*(S_pn-1)/2 = 16 * 12*11/2 = 1056 local-local visibilities are calculated
-  twice
-. for N is even floor(N/2) * S_pn*S_pn = 16/2 * 12*12 = 1152 local-remote visibilities are
-  calculated twice. For N is odd the local-remote visibilities are only calculated once.
-  
-and to check 1056 + 1152 = 2208 indeed.
-
-
-
-Number of multipliers per crosslet:
-The subband correlator needs to finished within one subband period, so within N_fft = 1024 clock
-cycles. The X_pn = 1296 visibililies per PN can be caluculated using one complex multiplier if
-the complex multiplier runs at 1296 / 1024 * 200 M > 253 MHz. For an oversampled filterbank with
-R_os <= 1.28 this requires 324 MHz, which is too much. All X_pn = 1296 can be calculated using
-two complex multipliers running at > 161 MHz. However another option is to use one pultiplier 
-per X_sq = 144 visibilities, so one complex multiplier per correlator cell and N/2 + 1 = 9 
-correlator cells in parallel. The FPGA has sufficient multipliers to support this scheme and the
-spare capacity of each correlator cell can be used to support a subband correlator with more 
-than 1 subband per integration interval, so N_crosslets > 1.
-
-Design decision:
-  Use 1 + N/2 parallel correlator cells, for the local-local visibilities and for the local-
-  remote visibilities for each remote source.
-
-
-What is the crosslet packet size?
-With S_pn = 12 signal inputs per PN and one crosslet per signal input there are 12 crosslets per
-packet. A crosslet is a W_crosslet = 16 bit complex value, so P_payload = 12 * 4 = 48 octets
-payload, so the effective packet size is P_packet = P_overhead + P_payload = 60 + 48 = 108 octets.
-The relative packet overhead for single crosslet payloads is P_overhead / P_packet = 60 / 108 = 
-55%. Note that P_overhead_dp + P_payload = 20 + 48 = 68 octets still meets the minimum Ethernet 
-payload size requirement of 46 octets.
-
-Maximum number of crosslets per lane:
-There are f_sub = 195312.5 subbands per s, and the packets have to travel N/2 hops. This yields
-a packet load of P_packet * f_sub * N/2 = (108 * 8b) * 195312.5 * 16 / 2 = 1.35 Gbps. The data
-load of only the payload data is P_payload * f_sub * N/2 = (48 * 8b) * 195312.5 * 16 / 2 =
-0.6 Gbps. Hence the small packet size causes a large packet overhead, but is still acceptable,
-since it is < L_lane = 7.8125 Gbps, so it fits on a single 10G lane of the ring.
-Multiple local crosslets could be transported via seperate packets, a lane can then fit about 
-7.8125 / 1.35 ~= 5 different crosslets. Packing the local crosslets into a single payload 
-reduces the packet overhead. The maximum number of crosslets per packet follows from 
-(P_overhead + x * P_payload * 8b) * f_sub * N/2 < L_lane. For N = 16 this yields x ~=
-(7.8125 Gbps / (16/2) / 195312.5 - 60) / (48 * 8b) = 12. With x = 12 crosslets the payload size
-is 12 * 48 = 576 and the effective packet size is P_packet = 60 + 576 = 636 octets. The relative
-packet overhead for multi crosslet payloads is P_overhead / P_packet = 60 / 636 ~= 9.4%. The
-packet load for multi crosslet payloads is (636 * 8b) * 195312.5 * 16/2 = 7.95 Gbps > 
-L_lane = 7.8125 Gbps, so this just does not fit on a 10GbE lane, due to the still significant
-packet overhead. Using x = 11 instead of x = 12 crosslets per packet yields a total crosslet
-packet load per lane of ((60 + 11 * 48) * 8b) * 195312.5 * 16/2 = 7.35 Gbps, which does fit on
-a lane.
-
-Design decision: 
-  Pack local crosslets into a single payload if N_crosslets > 1, because then the relative packet
-  overhead is much reduced to support transporting more crosslets per lane (11 instead of 5).
-
-  
-Maximum number of crosslets per correlator cell:
-An X_pn correlator cell can correlate N_clk / X_sq = 1024 / 144 = 7 different crosslets frequencies.
-With N = 16 for LBA, there need to be P_xc = N/2 + 1 = 9 of these X_pn correlator cells in parallel. One
-X_pn correlates the local-local crosslets and the other N/2 = 8 X_pn correlate the local-remote
-crosslets. These 9 X_pn in parallel can correlate up to 7 different crosslets. The link can
-transport maximum 11 crosslets. Hence the processing capacity of 9 X_pn is less than the IO 
-capacity of one 10GbE lane, therefore 9 X_pn in parallel can correlate 7 different crosslets.
-The crosslet data rate on a lane is then ((60 + 7 * 48) * 8b) * 195312.5 * 16/2 = 4.95 Gbps, so a
-utilization of 4.95 / 7.8125 = 63%. Another set of 9 X_pn could be used to correlate the remaining
-11 - 7 = 4 crosslets that can be transported via that lane. However, if more than N_crosslet = 7
-crosslets need to be correlated in parallel per integration interval, then it is easier to allocate
-an extra lane and to instantiate an extra set of 9 X_pn to correlate 14 crosslets in parallel in
-total.
-
-One X_pn takes one complex multiplier. For N_crosslets = 1 crosslet per integration interval using
-1 + N/2 = 9 X_pn uses only 144 / 1024 = 14% of the processing resources. However this is acceptable 
-because:
- the FPGA has sufficient multipliers
- it provides a clear design
- the spare capacity can be used to process more crosslets per integration interval
-
-Design decision:
-   Use 1 + N/2 = 9 parallel correlator cells to correlate N_crosslets = 1 crosslet, or upto 7
-   crosslets in parallel, per integration interval. 
-  
-
-Send more than one time slot per packet?
-To reduce the relative packet overhead for single crosslet XC it is an option to put multiple
-time slots per payload. Design decision: This is considered to complicated.
-
-
-What if a packet gets lost?
-The local crosslets cannot get lost, but remote packets may get lost. For transit crosslet packets
-a lost packet remains lost, because it cannot be replaced. For the subband correlator at this
-node the lost remote packets can be replaced by filler data, because the BSN aligner can use the
-local input as reference to detect lost packets. The BSN aligner will replace lost remote packets
-with filler packets that are flagged. The crosslets in the filler packets contain zero data, so in
-the correlator they do not contribute to the visibilities. Each X_pn correlator cell operates on
-crosslets from another source. Therefore each X_pn correlator cell has to maintain a count of the
-number of valid N_valid and of the number of flagged N_flagged crosslets per integration interval.
-The N_valid can be used to weight the visibility relative to the expected number of N_int
-crosslets. The N_flagged is used for monitoring. For every integration interval N_int = N_valid
-+ N_flagged should be true, by design of the BSN aligner.
-
-

 What if T_sq > T_hop latency on ring?
 What if T_sub > N/2 * T_hop latency on ring?
@@ -486,53 +359,7 @@ If the BSN aligners allows direct memory access to its input buffers then the X_
 correlator cell can read the crosslets from the BSN aligner in arbitrary order and multiple
 times.

-X_sq correlator cell:
-The X_sq correlator cell has two input streams. One input stream delivers the crosslet from 
-S_pn = 12 signal inputs on one PN and the other input stream delivers the crosslet from
-S_pn = 12 signal inputs on the same PN (for local-local visibilities) or another PN (for the
-local-remote visibilities). In total the X_sq calculates X_sq = S_pn * S_pn = 12*12 = 144 
-visibilities. The crosslets are delivered sequentially using a double for loop, so for each
-crosslet i in range(S_pn) on one input and for each crosslet j in range(S_pn) on the other
-input calculate the product and intergrate the visibility. This calculation sequence requires
-that crosslets can be addressed multiple times. For N_crosslets = 1 the X_sq correlator cell
-only correlates the first S_pn = 12 crosslets that are delivered on its two inputs. For
-N_crosslets > 1 the X_sq continues correlating the next S_pn = 12 crosslets that are delivered
-on its two inputs. Hence N_crosslets > 1 merely adds another for loop level to the X_sq, that
-loops for k in range(N_crosslets). The visibilities are calculated in order:
-  k, i, j
-  0, 0, 0
-  0, 0, 1
-  .  .  .
-  0, 0,11
-  0, 1, 0
-  0, 1, 1
-  .  .  .
-  0, 1,11
-  .  .  .
-  .  .  .
-  0,11, 0
-  0,11, 1
-  .  .  .
-  0,11,11
-  1, 0, 0
-  etc.
-  

-Support for other (shorter) integration period T_int_x?
- Longer T_int as multiple of 1 s can be supported outside SDP
- Longer T_int can be supported within SDP by:
-  . Using BSN scheduler
-  . Reduces M&C data rate
-  . Should still fit in number of bit of visibility 
- Shorter T_int < 1 s (PPS):
-  . Using BSN scheduler
-  . increases M&C data rate
-  . should still fit within PPS grid
- Publish T_int_x period ended event message to Station Control
-
-How can it be scaled to more than one crosslet per XST?
-  - multiple per packet
-  - multiple instances of one


 *******************************************************************************

--- a/libraries/base/dp/src/vhdl/dp_fifo_info.vhd
+++ b/libraries/base/dp/src/vhdl/dp_fifo_info.vhd
@@ -63,6 +63,12 @@
 --   dp_block_gen. This assumes that the DSP does pass on the valid, that the
 --   block size is known and that the first valid at the output corresponds
 --   to a sop.
+-- . These are related components that try to pass on sosi info from begin to
+--   end, without having to pass it on through each step in the sosi data
+--   processing.
+--   - dp_paged_sop_eop_reg
+--   - dp_fifo_info.vhd
+--   - dp_block_gen_valid_arr

 LIBRARY IEEE, common_lib, technology_lib;
 USE IEEE.STD_LOGIC_1164.ALL;

--- a/libraries/base/dp/src/vhdl/dp_paged_sop_eop_reg.vhd
+++ b/libraries/base/dp/src/vhdl/dp_paged_sop_eop_reg.vhd
@@ -32,6 +32,12 @@
 --     eop_wr_en <= snk_in.eop & snk_in.eop;
 --   to capture the input at the first wr_en and hold it for output at the
 --   next wr_en.
+-- . These are related components that try to pass on sosi info from begin to
+--   end, without having to pass it on through each step in the sosi data
+--   processing.
+--   - dp_paged_sop_eop_reg
+--   - dp_fifo_info.vhd
+--   - dp_block_gen_valid_arr

 LIBRARY IEEE, common_lib;
 USE IEEE.STD_LOGIC_1164.ALL;