diff --git a/applications/lofar2/doc/prestudy/station2_sdp_ring.txt b/applications/lofar2/doc/prestudy/station2_sdp_ring.txt index 6be04ac2948f4561e5eacb97a9c4887a65b8fde7..762ef9fe48ad6cf19c9d0c8726034f13c91eb6df 100755 --- a/applications/lofar2/doc/prestudy/station2_sdp_ring.txt +++ b/applications/lofar2/doc/prestudy/station2_sdp_ring.txt @@ -291,133 +291,6 @@ The beamformer function has the following sub functions: * Subband Correlator ******************************************************************************* -With transport scheme 1 crosslets from different source nodes are combined into one packet. -Scheme 3 packs only local crosslets into a packet. Compared to scheme 1, scheme 3: -- treats the local crosslets and remote crosslets independently -- has small payload and thus more packet overhead, but the packet load still fits on a lane -- has small payload that can be enlarged by transporting more local crosslets, to support - a subband correlator with N_crosslets > 1 per integration interval. - -Design decision: - Use transport scheme 3 with N/2 hops where every node sends its local crosslets N/2 hops, - because it is more flexible to have only local crosslets per packet. - - -Number of square correlator cells per PN: -There are S_pn = 12 local crosslets. A packet contains S_pn = 12 remote crosslets. There are N/2 -remote crosslet packets. The local crosslets have to be correlated with the local crosslets and -with each of the S_lba - S_pn remote crosslet packets. The correlation with the local crosslets -is a square matrix that yields X_sq = S_pn * S_pn = 144 visibilities. For the local-local square -correlator cell the efficiency is (S_pn * (S_pn+1)) / 2 / X_sq = 54%, but for the N/2 other -local-remote square correlator cells the efficiency is 100%. With N = 16 PN for LBA there are -N/2 = 8 remote crosslet packets. Hence together with the local crosslet visibilities this yields -X_pn = (floor(N/2) + 1) * X_sq = (8 + 1) * 144 = 1296 visibilities per PN. In total the subband -correlator calculates N * X_pn = 16 * 1296 = 20736 visibilities. There are -S_lba * (S_lba + 1)/2 = 192 * 193 / 2 = 18528 unique visibilities. The difference 20736 - 18528 -- 2208 is due to that: - -. for any N the N * S_pn*(S_pn-1)/2 = 16 * 12*11/2 = 1056 local-local visibilities are calculated - twice -. for N is even floor(N/2) * S_pn*S_pn = 16/2 * 12*12 = 1152 local-remote visibilities are - calculated twice. For N is odd the local-remote visibilities are only calculated once. - -and to check 1056 + 1152 = 2208 indeed. - - - -Number of multipliers per crosslet: -The subband correlator needs to finished within one subband period, so within N_fft = 1024 clock -cycles. The X_pn = 1296 visibililies per PN can be caluculated using one complex multiplier if -the complex multiplier runs at 1296 / 1024 * 200 M > 253 MHz. For an oversampled filterbank with -R_os <= 1.28 this requires 324 MHz, which is too much. All X_pn = 1296 can be calculated using -two complex multipliers running at > 161 MHz. However another option is to use one pultiplier -per X_sq = 144 visibilities, so one complex multiplier per correlator cell and N/2 + 1 = 9 -correlator cells in parallel. The FPGA has sufficient multipliers to support this scheme and the -spare capacity of each correlator cell can be used to support a subband correlator with more -than 1 subband per integration interval, so N_crosslets > 1. - -Design decision: - Use 1 + N/2 parallel correlator cells, for the local-local visibilities and for the local- - remote visibilities for each remote source. - - -What is the crosslet packet size? -With S_pn = 12 signal inputs per PN and one crosslet per signal input there are 12 crosslets per -packet. A crosslet is a W_crosslet = 16 bit complex value, so P_payload = 12 * 4 = 48 octets -payload, so the effective packet size is P_packet = P_overhead + P_payload = 60 + 48 = 108 octets. -The relative packet overhead for single crosslet payloads is P_overhead / P_packet = 60 / 108 = -55%. Note that P_overhead_dp + P_payload = 20 + 48 = 68 octets still meets the minimum Ethernet -payload size requirement of 46 octets. - -Maximum number of crosslets per lane: -There are f_sub = 195312.5 subbands per s, and the packets have to travel N/2 hops. This yields -a packet load of P_packet * f_sub * N/2 = (108 * 8b) * 195312.5 * 16 / 2 = 1.35 Gbps. The data -load of only the payload data is P_payload * f_sub * N/2 = (48 * 8b) * 195312.5 * 16 / 2 = -0.6 Gbps. Hence the small packet size causes a large packet overhead, but is still acceptable, -since it is < L_lane = 7.8125 Gbps, so it fits on a single 10G lane of the ring. -Multiple local crosslets could be transported via seperate packets, a lane can then fit about -7.8125 / 1.35 ~= 5 different crosslets. Packing the local crosslets into a single payload -reduces the packet overhead. The maximum number of crosslets per packet follows from -(P_overhead + x * P_payload * 8b) * f_sub * N/2 < L_lane. For N = 16 this yields x ~= -(7.8125 Gbps / (16/2) / 195312.5 - 60) / (48 * 8b) = 12. With x = 12 crosslets the payload size -is 12 * 48 = 576 and the effective packet size is P_packet = 60 + 576 = 636 octets. The relative -packet overhead for multi crosslet payloads is P_overhead / P_packet = 60 / 636 ~= 9.4%. The -packet load for multi crosslet payloads is (636 * 8b) * 195312.5 * 16/2 = 7.95 Gbps > -L_lane = 7.8125 Gbps, so this just does not fit on a 10GbE lane, due to the still significant -packet overhead. Using x = 11 instead of x = 12 crosslets per packet yields a total crosslet -packet load per lane of ((60 + 11 * 48) * 8b) * 195312.5 * 16/2 = 7.35 Gbps, which does fit on -a lane. - -Design decision: - Pack local crosslets into a single payload if N_crosslets > 1, because then the relative packet - overhead is much reduced to support transporting more crosslets per lane (11 instead of 5). - - -Maximum number of crosslets per correlator cell: -An X_pn correlator cell can correlate N_clk / X_sq = 1024 / 144 = 7 different crosslets frequencies. -With N = 16 for LBA, there need to be P_xc = N/2 + 1 = 9 of these X_pn correlator cells in parallel. One -X_pn correlates the local-local crosslets and the other N/2 = 8 X_pn correlate the local-remote -crosslets. These 9 X_pn in parallel can correlate up to 7 different crosslets. The link can -transport maximum 11 crosslets. Hence the processing capacity of 9 X_pn is less than the IO -capacity of one 10GbE lane, therefore 9 X_pn in parallel can correlate 7 different crosslets. -The crosslet data rate on a lane is then ((60 + 7 * 48) * 8b) * 195312.5 * 16/2 = 4.95 Gbps, so a -utilization of 4.95 / 7.8125 = 63%. Another set of 9 X_pn could be used to correlate the remaining -11 - 7 = 4 crosslets that can be transported via that lane. However, if more than N_crosslet = 7 -crosslets need to be correlated in parallel per integration interval, then it is easier to allocate -an extra lane and to instantiate an extra set of 9 X_pn to correlate 14 crosslets in parallel in -total. - -One X_pn takes one complex multiplier. For N_crosslets = 1 crosslet per integration interval using -1 + N/2 = 9 X_pn uses only 144 / 1024 = 14% of the processing resources. However this is acceptable -because: -- the FPGA has sufficient multipliers -- it provides a clear design -- the spare capacity can be used to process more crosslets per integration interval - -Design decision: - Use 1 + N/2 = 9 parallel correlator cells to correlate N_crosslets = 1 crosslet, or upto 7 - crosslets in parallel, per integration interval. - - -Send more than one time slot per packet? -To reduce the relative packet overhead for single crosslet XC it is an option to put multiple -time slots per payload. Design decision: This is considered to complicated. - - -What if a packet gets lost? -The local crosslets cannot get lost, but remote packets may get lost. For transit crosslet packets -a lost packet remains lost, because it cannot be replaced. For the subband correlator at this -node the lost remote packets can be replaced by filler data, because the BSN aligner can use the -local input as reference to detect lost packets. The BSN aligner will replace lost remote packets -with filler packets that are flagged. The crosslets in the filler packets contain zero data, so in -the correlator they do not contribute to the visibilities. Each X_pn correlator cell operates on -crosslets from another source. Therefore each X_pn correlator cell has to maintain a count of the -number of valid N_valid and of the number of flagged N_flagged crosslets per integration interval. -The N_valid can be used to weight the visibility relative to the expected number of N_int -crosslets. The N_flagged is used for monitoring. For every integration interval N_int = N_valid -+ N_flagged should be true, by design of the BSN aligner. - - What if T_sq > T_hop latency on ring? What if T_sub > N/2 * T_hop latency on ring? @@ -486,53 +359,7 @@ If the BSN aligners allows direct memory access to its input buffers then the X_ correlator cell can read the crosslets from the BSN aligner in arbitrary order and multiple times. -X_sq correlator cell: -The X_sq correlator cell has two input streams. One input stream delivers the crosslet from -S_pn = 12 signal inputs on one PN and the other input stream delivers the crosslet from -S_pn = 12 signal inputs on the same PN (for local-local visibilities) or another PN (for the -local-remote visibilities). In total the X_sq calculates X_sq = S_pn * S_pn = 12*12 = 144 -visibilities. The crosslets are delivered sequentially using a double for loop, so for each -crosslet i in range(S_pn) on one input and for each crosslet j in range(S_pn) on the other -input calculate the product and intergrate the visibility. This calculation sequence requires -that crosslets can be addressed multiple times. For N_crosslets = 1 the X_sq correlator cell -only correlates the first S_pn = 12 crosslets that are delivered on its two inputs. For -N_crosslets > 1 the X_sq continues correlating the next S_pn = 12 crosslets that are delivered -on its two inputs. Hence N_crosslets > 1 merely adds another for loop level to the X_sq, that -loops for k in range(N_crosslets). The visibilities are calculated in order: - k, i, j - 0, 0, 0 - 0, 0, 1 - . . . - 0, 0,11 - 0, 1, 0 - 0, 1, 1 - . . . - 0, 1,11 - . . . - . . . - 0,11, 0 - 0,11, 1 - . . . - 0,11,11 - 1, 0, 0 - etc. - -Support for other (shorter) integration period T_int_x? -- Longer T_int as multiple of 1 s can be supported outside SDP -- Longer T_int can be supported within SDP by: - . Using BSN scheduler - . Reduces M&C data rate - . Should still fit in number of bit of visibility -- Shorter T_int < 1 s (PPS): - . Using BSN scheduler - . increases M&C data rate - . should still fit within PPS grid -- Publish T_int_x period ended event message to Station Control - -How can it be scaled to more than one crosslet per XST? - - multiple per packet - - multiple instances of one ******************************************************************************* diff --git a/libraries/base/dp/src/vhdl/dp_fifo_info.vhd b/libraries/base/dp/src/vhdl/dp_fifo_info.vhd index bd4b0a41450cb531e2256405e693ab132896589b..173238f1843d7fb0a76612a8102d05d94932b6f9 100644 --- a/libraries/base/dp/src/vhdl/dp_fifo_info.vhd +++ b/libraries/base/dp/src/vhdl/dp_fifo_info.vhd @@ -63,6 +63,12 @@ -- dp_block_gen. This assumes that the DSP does pass on the valid, that the -- block size is known and that the first valid at the output corresponds -- to a sop. +-- . These are related components that try to pass on sosi info from begin to +-- end, without having to pass it on through each step in the sosi data +-- processing. +-- - dp_paged_sop_eop_reg +-- - dp_fifo_info.vhd +-- - dp_block_gen_valid_arr LIBRARY IEEE, common_lib, technology_lib; USE IEEE.STD_LOGIC_1164.ALL; diff --git a/libraries/base/dp/src/vhdl/dp_paged_sop_eop_reg.vhd b/libraries/base/dp/src/vhdl/dp_paged_sop_eop_reg.vhd index ac38a487f76479340306b4cd62489fe0b8046a54..45c527aa2d13eb717d374b8e83fe39e698f3c481 100644 --- a/libraries/base/dp/src/vhdl/dp_paged_sop_eop_reg.vhd +++ b/libraries/base/dp/src/vhdl/dp_paged_sop_eop_reg.vhd @@ -32,6 +32,12 @@ -- eop_wr_en <= snk_in.eop & snk_in.eop; -- to capture the input at the first wr_en and hold it for output at the -- next wr_en. +-- . These are related components that try to pass on sosi info from begin to +-- end, without having to pass it on through each step in the sosi data +-- processing. +-- - dp_paged_sop_eop_reg +-- - dp_fifo_info.vhd +-- - dp_block_gen_valid_arr LIBRARY IEEE, common_lib; USE IEEE.STD_LOGIC_1164.ALL;