From 0287cfbbfc50898635bc40ad994481992f28a3b2 Mon Sep 17 00:00:00 2001 From: Eric Kooistra <kooistra@astron.nl> Date: Fri, 29 Nov 2019 17:00:59 +0100 Subject: [PATCH] Worked on ring and workpackage planning, minor edits in other txt files. --- .../lofar2/doc/prestudy/station2_sdp_dsp.txt | 13 +- .../prestudy/station2_sdp_firmware_design.txt | 2 +- .../station2_sdp_firmware_planning.txt | 160 ++++- .../prestudy/station2_sdp_hdl_components.txt | 221 ++++--- .../lofar2/doc/prestudy/station2_sdp_ring.txt | 566 +++++++++++------- .../doc/prestudy/station2_sdp_timing.txt | 99 +-- 6 files changed, 701 insertions(+), 360 deletions(-) diff --git a/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt b/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt index 31bf474f7e..8b65973701 100644 --- a/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt +++ b/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt @@ -49,9 +49,16 @@ M&C: * Subband correlator ******************************************************************************* -First the local crosslets are correlated with themselves and then -the local crosslets are kept in a barrel shifter, such that they can also be correlated with the -remote crosslets that arrive in the packets. +- Subband select of N_crosslets local crosslets per signal input +- Interleave local crosslets from S_pn = 12 signal inputs +- XC ring +- XC dispatcher of local and remote crosslets +- X_sq correlator cell with N_crosslets * S_pn*S_pn visibilities and N_valid, N_flagged counts +- M&C: + . Subband select + . XC ring + . X_sq + ******************************************************************************* diff --git a/applications/lofar2/doc/prestudy/station2_sdp_firmware_design.txt b/applications/lofar2/doc/prestudy/station2_sdp_firmware_design.txt index f3c52363fd..600f497c64 100644 --- a/applications/lofar2/doc/prestudy/station2_sdp_firmware_design.txt +++ b/applications/lofar2/doc/prestudy/station2_sdp_firmware_design.txt @@ -23,7 +23,7 @@ Definitions Introduction - Context . ADD fig 3.1-1 (E)ICD and L3 PBS overview -- Scope +- Scope and purpose - Document overview Station overview diff --git a/applications/lofar2/doc/prestudy/station2_sdp_firmware_planning.txt b/applications/lofar2/doc/prestudy/station2_sdp_firmware_planning.txt index 873b3c2a16..19cb2fb62f 100644 --- a/applications/lofar2/doc/prestudy/station2_sdp_firmware_planning.txt +++ b/applications/lofar2/doc/prestudy/station2_sdp_firmware_planning.txt @@ -3,24 +3,24 @@ ******************************************************************************* Includes design, implementation, verification on HW, technical commissioning. -v1 v2 +v1 v2 Infrastructure 10 20 - Development environment using GIT, RadioHDL, updating existing components 20 . - BSP using Gemini Protocol, ARGS 10 . - Ethernet access (OSI 1-4) 10 20 - Ring access - - Applications: + + Application: 15 . - ADC ingress and time stamp 20 10 - Subband filterbank (critically sampled) 0 30 - Subband filterbank (oversampled) 10 . - Beamformer 20 . - Subband correlator -25 . - Transient buffer (DDR4 interface, subband select and DM >= 0, packet format, M&C, RW access via M&C) +25 . - Transient buffer (DDR4 interface, subband select and DM >= 0, packet format, M&C, RW access via M&C) 20 . - Transient detection 20 . - Subband offload 0 . - 160 MHz - + 35 . Integration 5 - FPGA pinning 10 - Interface test designs unb2c @@ -41,7 +41,7 @@ v2 : 10 less for critically sampled PFB ==> EK, JH: v1 estimate of April 2019 is still valid as v2 on 10 Oct 2019. -v3 : +v3 : Infrastructure 20 - Development environment using GIT, RadioHDL, updating existing components @@ -51,7 +51,7 @@ v3 : 20 - Ring access 10 - 10GbE access (OSI 1-4) - Applications: + Application: 15 - ADC input and time stamp 10 - Subband filterbank (critically sampled) 20 - Subband correlator @@ -61,7 +61,7 @@ v3 : 20 - Transient detection 30 - Oversampled subband filterbank 0 - Support 160 MHz - + Integration: 10 - Lab tests 5 - Technical commissioning Dwingeloo @@ -73,3 +73,147 @@ All: No oversampled filterbank: 20 + 5 + 10 + 20 + 20 + 10 + 15 + 10 + 20 + 10 + 25 + 20 + 20 + 0 + 10 + 5 + 5 = 225 + + + +******************************************************************************* +* SDP Workpackage (UniBoard2 HW + FW) +******************************************************************************* + +Firmware FPGA images: +- the SDP has one main firmware design unb2c_sdp, +- the integrated design of SDP is revision unb2c_sdp_station, +- per task there are revisions of unb2c_sdp that contain subsets of the SDP functionality, + + +Deliverables (D): items that are needed for a milestone +Milestones (M) : 'cake moments' when you demonstrate deliverables +- integration passed +- review passed + + +Tasks: + +INFRASTRUCTURE UniBoard2: + weeks nr task + 20 1) Maintain firmware development environment + - using GIT + - using RadioHDL + - updating existing VHDL library components + D=> Operational firmware development environment + D=> VHDL libraries verified in simulation + + 2) UniBoard2 board and test firmware + - unb2c board HW + D=> unb2c board detailed design document + D=> unb2c board schematic + D=> unb2c board layout + + M=> unb2c board detailed design document review (unb2b modifications) + M=> unb2c board schematic review + M=> unb2c board layout review (production ready) + M=> unb2c board lab validation using JTAG, unb2c_test designs OK + M=> unb2c board production validation using JTAG, unb2c_minimal_gmi OK + + 5 - unb2c FPGA pinning design + 10 - unb2c FPGA interface test designs + D=> unb2c_test design revisions (1GbE, 10GbE, DDR4, flash, ADC) + D=> unb2c_test_adc (read ADC samples from multiple inputs) + + + 20 3) UniBoard2 board support package (BSP) + - M&C by SCU via Gemini protocol + - M&C interface definition and generation using ARGS (doc, C, HDL) + D=> Gemini board for SCU M&C tests + D=> unb2c_minimal_gmi (1GbE, flash) + M=> unb2c_minimal_gmi validated using M&C by SCU (read design name) + +INFRASTRUCTURE SDP: + 10 4) Network access via 10GbE + - Ethernet MAC, UDP/IPv4, ARP, ping + D=> 10GbE HDL component including support for UDP/IPv4, ARP, ping + D=> unb2c_10GbE + M=> unb2c_10GbE validated using data capture on PC and ping + + 20 5) Ring access using test data and BSN monitor + D=> unb2c_ring_combiner for BF + D=> unb2c_ring_multicast for XC + D=> unb2c_ring_endcast for SO, TB + M=> unb2c_ring revisions verified in simulation + M=> unb2c_ring revisions validated on hardware using M&C on SCU + +APPLICATION SDP documents: + 6) Required documents + D=> Detailed design document of SDP firmware + D=> L1 ICD-11109 SDP-CEP: beamlet data protocol + D=> L1 ICD-11109 SDP-CEP: transient data protocol + D=> L2 ICD-11211 SC-SDP: FW register map and register definitions + D=> L2 ICD-11211 SC-SDP: UniBoard2 hardware M&C + D=> L2 ICD-11207 RCU2S-SDP: ADC interface + D=> L2 ICD-11209 STF-SDP: Time and frequency interface + D=> L2 ICD-11218 SDP-STCA: Subrack interface + + M=> SDP detailed design and interface documents ready for DDR + M=> SDP detailed design and interface documents updated for CDR + + D=> SDP firmware verification and maintenance document + M=> SDP all documents finished + +APPLICATION single node: + weeks nr task + 15 7) ADC input and timestamp (RCU2 interface) + ==> unb2c_sdp_adc_capture, read ADC or WG samples from databuffer via M&C + ==> unb2c_sdp_station (ADC) + + +M=> SDP ready for CDR + All major technical UniBoard2 hardware and SDP firmware risks are mitigated (by design and + based on validation with at least two UniBoard2 using JTAG, unb2c_minimal_gmi, unb2c_ring, + and unb2c_sdp_adc_capture). + + + 10 8) Subband filterbank (Fsub) + ==> unb2c_sdp_filterbank to read SST via M&C + ==> unb2c_sdp_station (ADC + SST) + +APPLICATION multi node: + weeks nr task + 20 9) Subband correlator (XC) + ==> unb2c_sdp_correlator_one_node, read XST via M&C and create ACM for one node + ==> unb2c_sdp_correlator_multi_node, read XST via M&C and use ring to create complete ACM + ==> unb2c_sdp_station (ADC + SST + XST) + +APPLICATION multi node / network output: + weeks nr task + 10 10) Beamformer (BF) + ==> unb2c_sdp_beamformer_bst_one_node, read BST via M&C + ==> unb2c_sdp_beamformer_output_one_input, output to CEP for one input from one node + ==> unb2c_sdp_beamformer_output_one_node, output to CEP and sum one node + ==> unb2c_sdp_beamformer_output_multi_node, output to CEP and use ring to sum nodes + ==> unb2c_sdp_station (ADC + SST + XST + BST + BF output) + ==> detailed design doc + + 25 11) Transient buffer (TB) + ==> unb2c_sdp_transient_buffer revisions (ADC + SST + TB readout, M&C access DDR4) + ==> unb2c_sdp_station (ADC + SST + XST + BST + BF output + TB readout) + ==> detailed design doc + + 20 12) Transient detection (TD) + ==> unb2c_sdp_transient_buffer revisions (ADC + TD event) + ==> unb2c_sdp_station (ADC + SST + XST + BST + BF output + TB readout + TD event) + ==> detailed design doc + + 20 13) Subband offload (SO) for AARTFAAC2.0 + ==> unb2c_sdp_subband_offload revisions (ADC + SST + SO, one node, all nodes via ring) + ==> unb2c_sdp_station (ADC + SST + XST + BST + BF output + TB readout + TD event + SO) + ==> detailed design doc + +INTEGRATION: + weeks nr task + 20 14) Station integration tests (using unb2c_sdp_station) + - Laboratory tests + - Technical commissioning Dwingeloo Test Station ("Huisje West") + - Technical commissioning Prototype Test Station + - Technical commissioning Pre-production Test Station + + diff --git a/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt b/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt index cc41dda277..20e64e6cfc 100644 --- a/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt +++ b/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt @@ -7,29 +7,31 @@ . rx_cnt: 18 bits, number Rx frames . brc : 1 bit, 0 if no Rx frames with CRC error, 1 if >= 1 Rx frames had a CRC error . sync : 1 bit, 1 if the frame with Rx sync was detected, else 0 - . align : 1 bit, 1 if all frames aligned OK, else 0 + . align : 1 bit, 1 if all frames aligned correctly, else 0 - RSP rad_latency: - . rx_latency : 16 bit, stores an internal count value when the Rx sync is detected. The internal count - restarts at the PPS sync. This measures the latency in clock cycles. + . rx_latency : 16 bit, stores an internal count value when the Rx sync is detected. The + internal count restarts at the PPS sync. This measures the latency in clock + cycles. - APERTIF dp_bsn_monitor - . mon_sync_timeout = '1' when the Rx sync did not occur within 200M cycles since last Rx sync ~= sync + . mon_sync_timeout = '1' when the Rx sync did not occur within 200M cycles since last + Rx sync ~= sync . mon_ready_stable = '1' when ready was always '1' during last Rx sync interval . mon_xon_stable = '1' when xon was always '1' during last Rx sync interval . mon_bsn_at_sync = BSN at Rx sync - . mon_nof_sop = number of sop during last Rx sync interval = rx_cnt - . mon_nof_err = number of err at eop during last Rx sync interval ~= brc + . mon_nof_sop = number of sop during last Rx sync interval = rx_cnt + . mon_nof_err = number of err at eop during last Rx sync interval ~= brc . mon_nof_valid = number of valid during last Rx sync interval . mon_bsn_first = BSN at first Rx sync --> not useful . mon_bsn_first_cycle_cnt = latency at first Rx sync --> should use every Rx sync like on RSP ==> Reuse dp_bsn_monitor with improvements: - . Monitor the packets per sync interval using Rx sync. This is more precise then using the PPS sync. - The Rx sync based values are only valid if mon_sync_timeout = 0. + . Monitor the packets per sync interval using Rx sync. This is more precise then using the PPS + sync. The Rx sync based values are only valid if mon_sync_timeout = 0. . Remove mon_bsn_first and mon_bsn_first_cycle_cnt. - . Add mon_latency, use PPS sync like in RSP to measure the latency between PPS sync and Rx sync in - number of clock cycles. + . Add mon_latency, use PPS sync like in RSP to measure the latency between PPS sync and Rx + sync in number of clock cycles. @@ -37,7 +39,8 @@ * DP encoder / decoder ******************************************************************************* - dp_packet_enc / dp_packet_dec - . Current dp_packet_enc encodes sosi fields into: CHAN (32b), sync & BSN (64b), DATA (>= 1 b), ERR (32b). + . Current dp_packet_enc encodes sosi fields into: CHAN (32b), sync & BSN (64b), DATA (>= 1 b), + ERR (32b). . Use new dp_packet_enc with CRC to mitigate false positive ETH CRC --> dp_packet_enc_crc: CHAN (32b), Sync & BSN (64b), DATA (>= 1 b), ERR (32b), CRC (32b) @@ -47,44 +50,55 @@ . After Rx frame the FSI is stripped and the CRC is replace by a boolean check (BRC). - CRC Error checking: - The CRC is a 32 bit number, so the chance that the CRC results in a false positive is 1/2**32 ~= 2.3e-10 or 1 - in 4.3e9. The packet rate is f_sub. Per T_sub interval the ring carries about 10 - 20 packets between N = 16 - nodes. Hence in total the packet rate of the ring for one LBA station is about 195312.5 * 20 * 16 ~= 60 M - packets / s. With 50 stations and LBA and HBA this become about a factor 100 more, so about 6 G packets / s. - If 0.01 % of the packets have errors, then the packet error rate is 0.6 M /s, so then once every 1 / (0.6e6 * 2.3e-10) - ~= 2 hours somewhere in LOFAR there will occur a false positive CRC. If such an error occurs, then it must not - cause the entire processing to stall. Therefore some additional check is necessary using a CRC. It is not - sufficient to check e.g. the ETH type and the expected packet length, because these do not cover the other - data in the packet. - Each station has about 100 10GbE links and there are 50 stations. Suppose the BER per link is 1e-10, so one bit - error per second per link, and that each bit error causes a CRC error. The total ring CRC error rate in LOFAR is then - 5000/s, so a false positive CRC will occur about once per 2**32/5000 = 10 days. This is not often, but if it - causes a station to fail then it is too often. - Having false positive CRCs even on a daily or weekly basis is too often. Therefore the application payload - should also have a CRC to ensure that no false positive CRC will occur during the life time of LOFAR 2.0. + The CRC is a 32 bit number, so the chance that the CRC results in a false positive is 1/2**32 + ~= 2.3e-10 or 1 in 4.3e9. The packet rate is f_sub. Per T_sub interval the ring carries about + 10 - 20 packets between N = 16 nodes. Hence in total the packet rate of the ring for one LBA + station is about 195312.5 * 20 * 16 ~= 60 M packets / s. With 50 stations and LBA and HBA + this become about a factor 100 more, so about 6 G packets / s. If 0.01 % of the packets have + errors, then the packet error rate is 0.6 M /s, so then once every 1 / (0.6e6 * 2.3e-10) ~= 2 + hours somewhere in LOFAR there will occur a false positive CRC. If such an error occurs, then it + must not cause the entire processing to stall. Therefore some additional check is necessary + using a CRC. It is not sufficient to check e.g. the ETH type and the expected packet length, + because these do not cover the other data in the packet. + Each station has about 100 10GbE links and there are 50 stations. Suppose the BER per link is + 1e-10, so one bit error per second per link, and that each bit error causes a CRC error. The + total ring CRC error rate in LOFAR is then 5000/s, so a false positive CRC will occur about + once per 2**32/5000 = 10 days. This is not often, but if it causes a station to fail then it + is too often. Having false positive CRCs even on a daily or weekly basis is too often. + Therefore the application payload should also have a CRC to ensure that no false positive CRC + will occur during the life time of LOFAR 2.0. Design decisions: -- Use CHAN (32b), Sync & BSN (64b), DATA (>= 1 b), ERR (32b), CRC (32b) to transport data between FPGAs - without false positive CRCs during the lifetime of LOFAR 2.0 to garantuee that only correct packets - enter the FPGA internal processing. The internal FPGA processing must be robust to lost packets, but - it does not have to be robust against corrupted packets (wrong contents, wrong length). +- Use CHAN (32b), Sync & BSN (64b), DATA (>= 1 b), ERR (32b), CRC (32b) to transport data between + FPGAs without false positive CRCs during the lifetime of LOFAR 2.0 to garantuee that only + correct packets enter the FPGA internal processing. The internal FPGA processing must be robust + to lost packets, but it does not have to be robust against corrupted packets (wrong contents, + wrong length). ******************************************************************************* -* dp_validate_crc +* dp_validate_crc (dp_store_and_forward) * - Validate (geldig verklaren) CRC and store-and-forward or store-and-discard this packet ******************************************************************************* The Ethernet/DP packet has two CRC checksums in the packet tail: -- the Ethernet CRC is calculated by the 1GbE MAC +- the Ethernet CRC is calculated by the 1GbE MAC and reported via the sosi.err field - the DP packet CRC is calculated by the dp_packet_dec. -The packet needs to be stored before it can be forwarded or discarded, because the entire packet is needed -to calculate and verify the CRC. The CRC results are reported via the sosi.err field at the end of packet -(eop). The dp_validate_crc forwards the packet when the CRC is oke and discards the packet when the CRC is -wrong. +The CRC information is in the packet tail. Therefore the packet needs to be stored, before it can +be validated, because the entire packet is needed to calculate and verify the CRC. Dependent on +the validation outcome the packet is either forwared or discarded. The CRC results are reported +via the sosi.err field at the end of packet (eop). The dp_validate_crc forwards the packet when +the sosi.err CRC is correct and discards the packet when the CRC is wrong. + +The dp_validate_crc uses dp_store_and_forward. The decision is known when the eop is received and +must then be applied at the sop, to either release the block or discard it. + +- Rx ETH MAC puts ETH CRC result in sosi.err at eop +- Rx DP decode puts DP CRC result in sosi.err at eop +- dp_validate_crc stores the packet and forwards if it has no error at the eop @@ -93,32 +107,35 @@ wrong. * - Validate (geldig verklaren) BSN at Rx sync and pass on or discard packets until next Rx sync ******************************************************************************* -The DP packet has a sync and BSN field in the packet header. This field is at the start of the packet (sop), -so it can be verified while the packet arrives. The Rx BSN at the Rx sync in the received packet should be -equal to the local Station BSN at the local sync. If the Rx sync BSN and the local sync BSN are: +The DP packet has a sync and BSN field in the packet header. This field is at the start of the +packet (sop), so it can be verified while the packet arrives. The Rx BSN at the Rx sync in the +received packet should be equal to the local Station BSN at the local sync. If the Rx sync BSN +and the local sync BSN are: - not equal, then discard all subsequent blocks until the Rx sync BSN is equal again, - equal, then pass on all subsequent blocks until the next Rx sync -The assumption is that if the BSN at sync is wrong, then the block processing at this node or at the remote -node has not been started properly, so then subsequent blocks will have wrong BSN also. If the BSN at -sync is oke, all nodes have been started properly abd then the BSN for all subsequent blocks in the sync -interval will be correct too. The sync and BSN value are not corrupted, because they are determined inside -the local and remote FPGA (so error free, because the logic is error free) and the remote BSN is -transported using a CRC (so error free, because the CRC detects all errors). +The assumption is that if the BSN at sync is wrong, then the block processing at this node or at +the remote node has not been started properly, so then subsequent blocks will have wrong BSN also. +If the BSN at sync is correct, all nodes have been started properly and then the BSN for all +subsequent blocks in the sync interval will be correct too. The sync and BSN value are not +corrupted, because they are determined inside the local and remote FPGA (so error free, because +the logic is error free) and the remote BSN is transported using a CRC (so error free, because +the CRC detects all errors). -The initial state to discard or pass on block is don't care, because the assumption is that the block -processing was (re)started properly on all nodes. At power up, choose to initially pass on packets. -If the packet with the Rx sync and BSN is lost, then the last decision to discard or pass on packets -remains, because it is still valid. +The initial state to discard or pass on block is don't care, because the assumption is that the +block processing was (re)started properly on all nodes. At power up, choose to initially pass +on packets. If the packet with the Rx sync and BSN is lost, then the last decision to discard +or pass on packets remains, because it is still valid. -The dp_validate_bsn_at_sync function verifies the entire 64 bit sync and BSN in an Rx packet. For local and -remote inputs the BSN can only differ by a limited number dependent on the latency differences between the -different inputs. Therefore if the input Rx BSN at sync matches the local Station BSN, then for the -BSN aligner that aligns the inputs based on the BSN it is sufficient to only use a fraction of the BSN. -Using the fraction of the BSN as index is suffivient to distinguish between blocks within the maximum BSN -latency. If the fraction N is a power of 2 , then only the log2(N) LSbits of the BSN need to be compared -to ensure that all inputs have the same 64 bit sync and BSN. +The dp_validate_bsn_at_sync function verifies the entire 64 bit sync and BSN in an Rx packet. +For local and remote inputs the BSN can only differ by a limited number dependent on the +latency differences between the different inputs. Therefore if the input Rx BSN at sync matches +the local Station BSN, then for the BSN aligner that aligns the inputs based on the BSN it is +sufficient to only use a fraction of the BSN. Using the fraction of the BSN as index is +sufficient to distinguish between blocks within the maximum BSN latency. If the fraction N is +a power of 2 , then only the log2(N) LSbits of the BSN need to be compared to ensure that all +inputs have the same 64 bit sync and BSN. @@ -134,26 +151,29 @@ Assumptions: - Usage schemes: . N = 2 inputs aligner with 1 local data and 1 remote data . N > 2 inputs aligner with 1 local data and N-1 remote data - . N >=2 inputs aligner with 0 local data and N remote data (not used on ring, but was used in APERTIF) + . N >=2 inputs aligner with 0 local data and N remote data (not used on ring, but was used in + APERTIF) . Treat all inputs equal, so no special role for a local input to suit more general usage -- The local sync and BSN sources on all FPGAs are synchronous, to avoid additional BSN latency between inputs. +- The local sync and BSN sources on all FPGAs are synchronous, to avoid additional BSN latency + between inputs. - Static input enable or disable via M&C - it is possible to enable or disable any combination of inputs - if all inputs are disabled then the output stops. - - if the input enable or disable setting is changed, then the BSN aligner restarts trying to achieve alignment. + - if the input enable or disable setting is changed, then the BSN aligner restarts trying to + achieve alignment. - disabled inputs are output with zero or flagged data - - for the ring with 1 local and 1 remote input the static input enable/disable supports the align modes: + - for the ring with 1 local and 1 remote input the static input enable/disable supports the + align modes: . disabled, . local only, . remote only, . local and remote - Input latency: - . the input latencies are fixed by design, so inputs have a maximum BSN latency g_bsn_latency that is fixed - and that does not have to be programmable via M&C. - . If all hops on the ring are active then the total latency will be (N-1)*(d + 1) where d is the transport - latency of each hop and 1 is due to store-and-forward at each node. Typically the total transport latency - on the ring is (N-1)*d < 1, so less than one block period. The total ring latency is covered by - g_bsn_latency > (N-1)*(d + 1). + . the input latencies are fixed by design, so inputs have a maximum BSN latency g_bsn_latency + that is fixed and that does not have to be programmable via M&C. + . If all hops on the ring are active then the total latency will be (N-1)*t_hop, where t_hop is + the transport latency of each hop. The total transport latency on the ring is (N-1)*t_hop. + The total ring latency is covered by g_bsn_latency > (N-1)*t_hop. - Lost input blocks: . accept that the corresponding output is lost too, or output filler block to replace lost block . should not cause subsequent blocks to get lost too @@ -167,22 +187,23 @@ Assumptions: . smoothen bursts (only an issue with remote drive output) . provide output throttling (requires output FIFOs or data blocks that have sufficient gaps) - Stopped input: - . If after some block periods (e.g. g_bsn_latency) there is no more block pending at any input, then the - output stops and the BSN aligner should restart trying to achieve alignment. + . If after some block periods (e.g. g_bsn_latency) there is no more block pending at any input, + then the output stops and the BSN aligner should restart trying to achieve alignment. Notes: - In LOFAR and APERTIF the BSN aligner does loose more blocks due to input flush and realign -- a BSN aligner can align at any BSN, using a sync aligner that can only align at the sync, would cause - loosing an entire sync interval to realign, which is not acceptable +- a BSN aligner can align at any BSN, using a sync aligner that can only align at the sync, would + cause loosing an entire sync interval to realign, which is not acceptable - in APERTIF the sync_checker looses entire sync intervals to ensure filled sync intervals -- In LOFAR and APERTIF the output is driven by the remote input to add minimal latency, however this - results in loosing more packets and having to realign if input packets get lost. -- In dp_bsn_align the artifical local data stream was used to ensure that the output block size was correct, - by using extra CRC checking (ETH CRC and DP CRC) and store and forward in Rx it is already certain that only - correct input packets arrive at the BSN aligner input. Therefore an artifical local data stream is not needed. +- In LOFAR and APERTIF the output is driven by the remote input to add minimal latency, however + this results in loosing more packets and having to realign if input packets get lost. +- In dp_bsn_align the artifical local data stream was used to ensure that the output block size + was correct, by using extra CRC checking (ETH CRC and DP CRC) and store and forward in Rx it is + already certain that only correct input packets arrive at the BSN aligner input. Therefore an + artifical local data stream is not needed. Design options: @@ -190,40 +211,44 @@ Design options: . Rely on next received packet: - check per input that the BSN increments +1 - requires a timeout or overflow detection on other inputs to detect a burst of lost packets - - after a burst of lost packets, typically the output cannot catch up anymore, so then the BSN aligner - needs to flush its input buffer and restart. + - after a burst of lost packets, typically the output cannot catch up anymore, so then the BSN + aligner needs to flush its input buffer and restart. . Per packet using a local block reference. - The local block reference is offset by at least g_bsn_latency relative to the local BSN source, to - ensure that all inputs should have a new block pending for output. This is possible, because the input - latencies are static and within a fixed range: + The local block reference is offset by at least g_bsn_latency relative to the local BSN + source, to ensure that all inputs should have a new block pending for output. This is + possible, because the input latencies are static and within a fixed range: - in circular buffer the Wr flag for the lost block remains unset - in FIFO by no pending input or pending input with higher BSN then current output BSN ==> Design decision: - - Use local block reference to define when to detect lost packets, because one lost block should not - cause subsequent blocks to get lost too. + - Use local block reference to define when to detect lost packets, because one lost block + should not cause subsequent blocks to get lost too. - Output driven by remote input block arrival or by local block reference . in case of 1 remote input, the remote input does not need a FIFO if it drives the output . in case of > 1 remote input, then the remote inputs also requires FIFOs - . using local input increases the latency from remote input to output, because fixed to the T_sub grid + . using local input increases the latency from remote input to output, because fixed to the + T_sub grid . using local input at T_sub grid avoids bursts, this can also be handled using flow control - . with local input driving the output the assumption is that if the local input has M packets, then all remote - inputs will have delivered at least one frame, so there should be a sop pending from all. - . if there is no local input, then an artifical local input can be derived when BSN is equal on all enabled remote inputs. - . if remote input is lost, then entire output is lost if remote drives output, because there is not enough spare time - to still output the other input packets - . For remote driven output a slot can be output when for all active inputs there is a block. However if one or - a series of packets got lost, then the other inputs will overflow. Hence remote driven output needs a timeout - to keep the output running, so a form of local driven output. Hence to avoid additional packet loss on other - inputs or of subsequent packets in time it is necessary to have a local driven output. Therefore using a remote - driven output is not feasible. + . with local input driving the output the assumption is that if the local input has M packets, + then all remote inputs will have delivered at least one frame, so there should be a sop + pending from all. + . if there is no local input, then an artifical local input can be derived when BSN is equal on + all enabled remote inputs. + . if remote input is lost, then entire output is lost if remote drives output, because there is + not enough spare time to still output the other input packets + . For remote driven output a slot can be output when for all active inputs there is a block. + However if one or a series of packets got lost, then the other inputs will overflow. Hence + remote driven output needs a timeout to keep the output running, so a form of local driven + output. Hence to avoid additional packet loss on other inputs or of subsequent packets in time + it is necessary to have a local driven output. Therefore using a remote driven output is not + feasible. ==> Design decision: - - Use local block reference to define when aligned blocks should be output, because one lost block should - not cause subsequent blocks to get lost too, which is more important then adding minimal latency and - potentially saving BSN aligner input buffer memory. + - Use local block reference to define when aligned blocks should be output, because one lost + block should not cause subsequent blocks to get lost too, which is more important then + adding minimal latency and potentially saving BSN aligner input buffer memory. - Generation of local block reference to define the output pace: @@ -244,7 +269,7 @@ Design options: - Filler data insertion . Whether to drop a block or to replace it by a filler block depends on the application - for BF drop all inputs, because beam is affected - - for XC insert filler data, because visibilities of active inputs are still oke. + - for XC insert filler data, because visibilities of active inputs are still correct. - for the output via the Network insert filler data to keep the output at the nominal rate, such that the destination can distinguish between data blocks that got lost inside Station and packet loss on the Network. @@ -470,7 +495,7 @@ Design options: because the Station BSN is 50 bit. -. Cicrular buffers on CEP +. Circular buffers on CEP On CEP the beamlet data is written into a circular buffer based on the time stamp. A flag indicates whether data in the circular buffer is valid. The size of the circular buffer is in the order of hundreds of ms to cover the distance latency of the international stations. An array of tupples lists the lenght of continuous blocks in the circular buffer, and diff --git a/applications/lofar2/doc/prestudy/station2_sdp_ring.txt b/applications/lofar2/doc/prestudy/station2_sdp_ring.txt index 90e1fe9030..b7996af32d 100644 --- a/applications/lofar2/doc/prestudy/station2_sdp_ring.txt +++ b/applications/lofar2/doc/prestudy/station2_sdp_ring.txt @@ -12,9 +12,10 @@ The oversampling increases the processing rate and data rate by a factor R_os. T Processing capacity per subband period: Assume the processing for critically sampled filterbank runs at 200 MHz and for oversampled subbands it will run at R_os * 200 MHz. For R_os = 1.28 this requires processing at >= 256 MHz. -This means that the processing has N_fft = 1024 clock cycles avaiable per subband period T_sub, -independent of R_os. In this way if the processing for the critically sampled subbands fits -within N_clk = N_fft = 1024 clock cycles, then it will also fit for the oversampled subbands. +This means that the processing has N_clk = N_fft = 1024 clock cycles avaiable per subband +period T_sub, independent of R_os. In this way if the processing for the critically sampled +subbands fits within N_clk = N_fft = 1024 clock cycles, then it will also fit for the +oversampled subbands. IO capacity per 10GbE lane: The IO data rate on the ring increases with the oversampling factor R_os. For oversampled data @@ -163,15 +164,41 @@ application. Ring latency: -The latency of 1 hop is about 0.2 us. The time to transmit one Ethernet frame of 1500 octets at -10Gbps is about 1.2 us and a jumbo frame of 6400 octets takes about 5.12 us (= T_sub). Hence for -packets >~ 300 octets the ring latency is dominated by the store-and-forward routing at each node. -The 10GbE Ethernet MAC uses 64 bit data. At 200 MHz this can achieve 64 * 0.2 = 12.8 Gbps. Hence -if the processing operates without data valid gaps, then the Ethernet transmit will not run empty -during a payload. Therefore it is not necessary to use a fill FIFO, which would add to the ring -latency. For a packet that travels the entire ring the latency is then about (N-1) * T_sub and -the corresponding FIFO depth to align the local data with this remote data is (N-1) * packet size. - +The RSP boards use wormhole routing on the ring. The latency of 1 hop between RSP boards is about +0.2 us. The time to transmit one Ethernet frame of 1500 octets at 10Gbps is about 1.2 us and a +jumbo frame of 6400 octets takes about 5.12 us (= T_sub). Hence for packets >~ 300 octets the +for LOFAR2.0 SDP ring latency will be dominated by the t_store for the store-and-forward routing +at each node. The processing uses 2 * 18 bit beamlet data. The 10GbE Ethernet MAC uses 64 bit +data. The repacking of the beamlet data into payload data causes gaps that need to be removed +using a fill FIFO, before the packet can be transmitted. This fill FIFO adds t_fill to the +latency. Other latency is caused by the pipelining delay t_pipe in the FPGA and the propagation +delay t_prop. The travel latency t_travel = t_pipe + t_fill + t_prop + t_store of one hop +consists of: + + Delay Where +- t_pipe Tx node : pipelining processing +- t_fill Tx node : fill the Tx FIFO sufficiently to ensure Tx of complete packet +- t_prop Lane : propagation on lane +- t_store Rx node : store and forward or discard + +Packets are transmitted at a local block grid. The assumption is that at the local block grid the +remote Rx packet has been received, if not then it is lost and will not arrive later. The local +block grid is set by the BSN aligner when it first achieves input alignment. This BSN alignment +is based on the t_travel latency that occured between this node and the previous node. The +t_travel latency can vary slightly in time and vary slightly between hops, due to the clock +domain crossings. The variation in t_travel will be small, because only one kind of packets are +transported per lane, so the traffic is not influenced by other streams. To account for the +variation in t_travel a margin is needed to start a fixed local block grid with period T_sub +for the aligned packets. The t_margin ensures that at the block grid the expected remote packet +must have been received, and if not then it was lost. +- t_margin Rx node : margin to align inputs + +The actual latency per hop is t_hop = t_travel + t_margin. The variation in t_travel is small and +t_margin is fixed by design, so t_hop is about the same for all hops in the ring and for all +block periods in time. Hence the total latency along N nodes in the ring is then (N-1)*t_hop. +The dominant latencies in t_hop are t_fill and t_store. If t_hop < T_sub, then each hop will +require one block buffering of the local input to be able to align it to the remote input. The +total buffering for the local input is then (N-1)*P_packet. OSI 4 Transport layer: Use UDP/IP/ETH or only ETH on the ring: @@ -282,12 +309,14 @@ The ring function has the following sub functions: - Align packets for processing (use filler data on inputs with lost packets) -Ring access schemes: +Ring access and transport schemes: -- 1) start node sends packet to end node, intermediate nodes modify the packet. -- 2a) each node starts sending its packets to an end node, intermediate nodes pass on the packet -- 2b) each node starts sending its packets to an end node, intermediate nodes pass on the packet - and use the packet (= multi cast) +- 1) ring combiner scheme: start node sends packet to end node, intermediate nodes modify the + packet (= combine local with remote). +- 2a) ring endcast scheme: each node starts sending its packets to an end node (= end cast), + intermediate nodes pass on the packet +- 2b) ring multicast scheme: each node starts sending its packets to an end node, intermediate + nodes pass on the packet and use the packet (= multi cast) If both scheme 1 and 2 are suitable, then scheme 1 typically yields a larger payload, because it reserves slots for all nodes, whereas the payload for scheme 2 only contains data from one node. @@ -396,23 +425,63 @@ end node, so no extra logic is needed. Ring adder payload processing: The full band station beam has S_sub_bf = 488 beamlets per polarization, so in total there are -N_pol * S_sub_bf = 2 * 488 = 976 complex beamlets per subband period of N_fft = 1024 cycles @ -200 MHz. For an oversampled filterbank with R_os > 1 the processing rate is increased to -200 * R_os MHz, so there are still N_fft = 1024 cycles available to process 976 beamlets. The ring -adder adds the local beamlet sum to the received beamlet sum and passes on the result. The beamlet -sum is received as a packet with 64 bit packed data at 156.25 MHz (64 * 156.25M = 10G). The 976 -beamlets fit in 976 * 18b * 2 / 64b = 549 64b words. The packet is processed at 200 * R_os MHz. - - . from 10GbE --> - . Rx packet 64b @ 156MHz --> Rx FIFO to dp_clk domain --> - . Rx packet 64b @ 200MHz --> DP/ETH decode to discard or extract payload of 549 words--> - . Rx payload 64b @ 200MHz --> repack 549 words to 976 beamlets --> - . Align remote and local beamlets --> - . Sum remote and local beamlets --> repack 976 beamlets to 549 words --> - . Tx payload 64b @ 200MHz --> DP/ETH encode to add header and tail --> - . Tx packet 64b @ 200MHz --> Tx FIFO to tx_clk domain --> - . Tx packet 64b @ 156MHz --> - . to 10GbE +N_pol * S_sub_bf = 2 * 488 = 976 complex beamlets per subband period of N_clk = 1024 cycles. The +ring adder adds the local beamlet sum to the received beamlet sum and passes on the result. +The beamlet sum is received as a packet with 64 bit packed data at 156.25 MHz (64 * 156.25M = +10G). The 976 beamlets fit in 976 * 18b * 2 / 64b = 549 64b words. The packet header and tail +overhead is P_overhead = 60 octets, so about 8 64b words and thus the effected packet size is +P_packet = 8 + 549 = 557 64b words. + + . from 10GbE MAC --> + . @ 156MHz Rx packet 64b --> Rx FIFO from Rx domain to DP domain --> + . @ 200MHz Rx packet 64b --> DP/ETH decode to discard or extract payload of 549 words--> + . @ 200MHz Rx payload 64b --> repack 549 words to 976 beamlets --> + . @ 200MHz Align remote and local beamlets --> + . @ 200MHz Sum remote and local beamlets --> repack 976 beamlets to 549 words --> + . @ 200MHz Tx payload 64b --> DP/ETH encode to add header and tail --> + . @ 200MHz Tx packet 64b --> Tx fill FIFO from DP domain to Tx domain --> + . @ 156MHz Tx packet 64b --> + . to 10GbE MAC + + +The DP/ETH encoding and decoding is done in the DP domain, because the Rx meta information is +needed there and the Tx meta information is available there. The DP/ETH decoding first validates +the ETH CRC (that was checked by the 10GbE MAC) and the DP CRC (that is checked by the DP +decoding). Both CRC are validate using the same input store-and-forward buffer, so that the Rx +packet only needs to be buffered once. If both CRC are correct then, the Rx payload is released +from the store-and-forward buffer, else it is discarded. The released payload is then repacked +to obtain the remote beamlets. The remote beamlets are then aligned to the local beamlets and +summed. The summed beamlets are repacked to 64b data and then DP/ETH encoded. The DP encoding +adds the DP CRC and the 10GbE MAC will add the ETH CRC. + +In the DP domain, at 200 * R_os MHz, there are N_clk = 1024 cycles available to process the +packet. The clock domain crossing from Rx 156M to DP 200M causes gaps in the data, but these gaps +are not sufficient (or equivalently, the DP clock is not fast enough) to perform the Rx repacking +from 64b to 2*18b beamlets, because r = 200/156.25 * 36/64 = 0.72 < 1. Hence effectively the DP +rate is a factor r slower than the Rx rate. Therefore the Rx repacking needs to apply +backpressure that can be accepted by the Rx FIFO. When the Rx packet has been received, the DP +processing will have accepted a fraction r of it, so the Rx FIFO will fill up to about +(1-r) * P_packet = (1 - 0.72) * 557 = 156 64b words and then run empty again after the last 64b +word was received. The processing of the beamlets occurs without gaps. However the Tx repacking +from 2*18b beamlets to 64b does cause gaps, and these gaps in the DP domain are too many (or +equivalently, the DP clock is too slow) to perform continuous packet transmission in the Tx +domain, because r < 1. Therefore the clock domain crossing Tx FIFO needs to use a fill FIFO. +The Tx FIFO will first need to be filled by about (1-r) * P_packet = (1 - 0.72) * 557 = 156 64b +words, before the Tx can start to ensure that the packet is transmitted without gaps. +Note that (1/r - 1)/(1/r) = (1-r)/1 = 1-r, so defining r or 1/r does not matter. + +The travel latency per hop t_travel can be expressed in DP clock cycles at 200 MHz by: +- The latency of one hop between RSP boards is a good estimate of the sum of the pipelining and + propagation, so t_pipe + t_prop ~= 0.2 us, or 40 clock cycles. +- The latency of the store-and-forward buffer is t_store ~= 557 / 156.25M = 3.6 us, or 713 clock + cycles +- The latency of the Tx fill FIRO is t_fill ~= 156 / 156.25M = 1.0 us, or 200 clock cycles. +Hence t_travel = 40 + 713 + 200 = 953 clock cycles, so using t_margin = 1024 - 953 = 71 would +allow the beamformer ring to only have to buffer one block per hop. The t_margin must not be set +too small, because then an Rx packet may be considered lost, while it is still just about to +arrive. The minimal local input buffering occurs for t_margin = 0. Instead of N-1 blocks the +buffer then needs to fit (N-1) * 953 / 1024 blocks. For N = 16 this would save one block, which +is only 953/1024 = 7% so negligible. ? Does align belong to ring or to beamlet ring adder? @@ -425,21 +494,29 @@ beamlets fit in 976 * 18b * 2 / 64b = 549 64b words. The packet is processed at adds (so does not have BF weigths like the local BF). -Local beamlet sums FIFO size: +BF BSN aligner input buffer size: The local subband data needs to be buffered until the beamlet sum arrives. The size of the buffer -is determined by last node, because then the beamlet sum has travelled N-1 hops. For each hop the -packet is delayed by: - - packet encoding - - packet transport over the ring -After each hop the packet is delayed by: - - store-and-forward to be able to check the CRC - - packet decoding - - packet processing -The store-and-forward causes a latency of one block period (T_sub) per hop and is the dominant -factor in the latency. During this latency N-1 local blocks need to be buffered. Assume that the -processing and transport delays are shorter than one block period, so buffering one extra local -block is sufficient to compensate it. Per PN this yields a FIFO size of N_pol * S_sub_bf * N * -N_complex * W_subband = 2 * 488 * 16 * 2 * 18 = 562176 bit, which takes about 32 M20k block RAMs. +is determined by last node, because then the beamlet sum has travelled N-1 hops. Assume that +t_hop = T_sub, so buffering one local block per hop and one extra block is sufficient to +compensate for the latency along the ring. This yields an input buffer size of K = N = 16 +blocks, so K * N_pol * S_sub_bf * N_complex * W_subband = 16 * 2 * 488 * 2 * 18 = 562176 bit, +which takes about 32 M20k block RAMs. The BSN aligner for the beamlet ring adder has two inputs, +so it will use 2 * 32 = 64 M20k block RAMs. + + +What if a packet gets lost? +The local beamlets cannot get lost, but remote packets may get lost. The BSN aligner will replace +lost remote packets with filler packets that are flagged. The beamlets in the filler packets +contain zero data, so in the beamformer they do not contribute to the beamlet sum and the beamlet +sum that is passed on only contains the local beamlet values. Hence a lost packet results in an +incomplete station beam. The incomplete beamlet sum is passed on along the ring, to preserve the +nominal line rate at the subsequent hops (i.e. to avoid propagation of the lost packet). However, +the incomplete beamlet sum must be flagged via a bit in the DP channel field. At the final node +The flagged incomplete beamlet sum is send to CEP, to preserve the nominal line rate. At CEP the +flagged incomplete beamlet data has to be discarded, because the shape and gain of the incomplete +beam then differs, dependent on where on the ring the packet got lost. + + Ring modes: - off @@ -452,33 +529,34 @@ With dp_bsn_align_v2 all these modes are supported by enabling/disabling the cor The beamformer function has the following sub functions: - "Beamlet subband select" : Select S_sub_bf = 488 subbands per signal input -- "Local beamformer" : Form N_pol * S_sub_bf = 2 * 488 = 976 local beamlet sums for - S_pn = 12 signal inputs +- "Local beamformer" : Form N_pol * S_sub_bf = 2 * 488 = 976 local beamlet sums for S_pn = 12 + signal inputs - "Beamlet ring adder" : - if start node: - - Encode local beamlet sums packet to ring - else: - - Buffer the local beamlet sums for ~= N subband intervals - - Decode remote beamlet sums packet from ring - - Align remote beamlet sums packet and local beamlet sums packet - - Add local beamlet sums to remote beamlet sums packet - if transit node: - - Encode beamlet sums packet to ring - else: - - "Beamlet data output" : On output node scale and output final beamlet sums + if start node: + - Encode local beamlet sums packet to ring + else: + - Buffer the local beamlet sums for ~= N subband intervals + - Decode remote beamlet sums packet from ring + - Align remote beamlet sums packet and local beamlet sums packet + - Add local beamlet sums to remote beamlet sums packet + if transit node: + - Encode beamlet sums packet to ring + else: + - "Beamlet data output" : On output node scale and output final beamlet sums - "Beamlet statistics (BST)": Calculate BST for beamlet sums, output node has final BST + ******************************************************************************* * Subband Correlator ******************************************************************************* With transport scheme 1 crosslets from different source nodes are combined into one packet. Scheme 2b packs only local crosslets into a packet. Compared to scheme 1, scheme 2b: -- treats the local crosslets and remote independently -- has small payload and thus more packet overhead, but the load still fits on a lane +- treats the local crosslets and remote crosslets independently +- has small payload and thus more packet overhead, but the packet load still fits on a lane - has small payload that can be enlarged by transporting more local crosslets, to support - a subband correlator with N_crosslets > 1. + a subband correlator with N_crosslets > 1 per integration interval. Design decision: Use transport scheme 2b with N/2 hops where every node sends its local crosslets N/2 hops, @@ -488,12 +566,23 @@ Design decision: Number of square correlator cells per PN: There are S_pn = 12 local crosslets. A packet contains S_pn = 12 remote crosslets. There are N/2 remote crosslet packets. The local crosslets have to be correlated with the local crosslets and -with each of the remote crosslet packets. The correlation with the local crosslets is a square -matrix that yields X_sq = S_pn * S_pn = 144 visibilities. For the local-local square correlator -cell the efficiency is (S_pn * (S_pn+1)) / 2 / X_sq = 54%, but for the N/2 other local-remote -square correlator cells the efficiency is 100 %. With N = 16 PN for LBA there are N/2 = 8 remote -crosslet packets. Hence together with the local crosslet visibilities this yields -X_pn = (N/2 + 1) * X_sq = (8 + 1) * 144 = 1296 visibilities per PN. +with each of the S_lba - S_pn remote crosslet packets. The correlation with the local crosslets +is a square matrix that yields X_sq = S_pn * S_pn = 144 visibilities. For the local-local square +correlator cell the efficiency is (S_pn * (S_pn+1)) / 2 / X_sq = 54%, but for the N/2 other +local-remote square correlator cells the efficiency is 100 %. With N = 16 PN for LBA there are +N/2 = 8 remote crosslet packets. Hence together with the local crosslet visibilities this yields +X_pn = (floor(N/2) + 1) * X_sq = (8 + 1) * 144 = 1296 visibilities per PN. In total the subband +correlator calculates N * X_pn = 16 * 1296 = 20736 visibilities. There are +S_lba * (S_lba + 1)/2 = 192 * 193 / 2 = 18528 unique visibilities. The difference 20736 - 18528 +- 2208 is due to that: + +. for any N the N * S_pn*(S_pn-1)/2 = 16 * 12*11/2 = 1056 local-local visibilities are calculated + twice +. for N is even floor(N/2) * S_pn*S_pn = 16/2 * 12*12 = 1152 local-remote visibilities are + calculated twice. For N is odd the local-remote visibilities are only calculated once. + +and to check 1056 + 1152 = 2208 indeed. + Number of multipliers per crosslet: @@ -509,119 +598,86 @@ than 1 subband per integration interval, so N_crosslets > 1. Design decision: Use 1 + N/2 parallel correlator cells, for the local-local visibilities and for the local- - remote visibilitie for each remote source. + remote visibilities for each remote source. What is the crosslet packet size? With S_pn = 12 signal inputs per PN and one crosslet per signal input there are 12 crosslets per -packet. A crosslet is a W_crosslet = 16 bit complex value, so 12 * 4 = 48 octets payload, so the -effective packet size is p_packet = 60 + 48 = 108 octets. The relative packet overhead for single -crosslet payloads is P_overhead / P_packet = 60 / 108 = 55 %. +packet. A crosslet is a W_crosslet = 16 bit complex value, so P_payload = 12 * 4 = 48 octets +payload, so the effective packet size is P_packet = P_overhead + P_payload = 60 + 48 = 108 octets. +The relative packet overhead for single crosslet payloads is P_overhead / P_packet = 60 / 108 = +55 %. Note that P_overhead_dp + P_payload = 20 + 48 = 68 octets still meets the minimum Ethernet +payload size requirement of 46 octets. Maximum number of crosslets per lane: There are f_sub = 195312.5 subbands per s, and the packets have to travel N/2 hops. This yields a packet load of P_packet * f_sub * N/2 = (108 * 8b) * 195312.5 * 16 / 2 = 1.35 Gbps. The data -load of only the payload data is payload size * f_sub * N/2 = (48 * 8b) * 195312.5 * 16 / 2 = +load of only the payload data is P_payload * f_sub * N/2 = (48 * 8b) * 195312.5 * 16 / 2 = 0.6 Gbps. Hence the small packet size causes a large packet overhead, but is still acceptable, -since it is < L_lane = 7.8125 Gbps so it fits on a single 10G lane of the ring. +since it is < L_lane = 7.8125 Gbps, so it fits on a single 10G lane of the ring. Multiple local crosslets could be transported via seperate packets, a lane can then fit about 7.8125 / 1.35 ~= 5 different crosslets. Packing the local crosslets into a single payload reduces the packet overhead. The maximum number of crosslets per packet follows from -(P_overhead + X * 48 * 8b) * f_sub * N/2 < L_lane. For N = 16 this yields X ~= -(7.8125 Gbps / (16/2) / 195312.5 - 60) / (48 * 8b) = 12. With 12 crosslets the payload size is -16 * 48 = 576 and the effective packet size is P_packet = 60 + 576 = 636 octets. The relative +(P_overhead + x * P_payload * 8b) * f_sub * N/2 < L_lane. For N = 16 this yields x ~= +(7.8125 Gbps / (16/2) / 195312.5 - 60) / (48 * 8b) = 12. With x = 12 crosslets the payload size +is 12 * 48 = 576 and the effective packet size is P_packet = 60 + 576 = 636 octets. The relative packet overhead for multi crosslet payloads is P_overhead / P_packet = 60 / 636 ~= 9.4%. The -packet load for multi crosslet payloads is (636 * 8b) * 195312.5 * 16/2 = 7.95 Gbps < -L_lane = 7.8125 Gbps, so this just not fits on a 10GbE lane, due to the still significant packet -overhead. Using X = 11 instead of 12 crosslets per packet yields a total crosslet packet load -per lane of ((60 + 11 * 48) * 8b) * 195312.5 * 16/2 = 7.35 Gbps, which does fit on a lane. +packet load for multi crosslet payloads is (636 * 8b) * 195312.5 * 16/2 = 7.95 Gbps > +L_lane = 7.8125 Gbps, so this just does not fit on a 10GbE lane, due to the still significant +packet overhead. Using x = 11 instead of x = 12 crosslets per packet yields a total crosslet +packet load per lane of ((60 + 11 * 48) * 8b) * 195312.5 * 16/2 = 7.35 Gbps, which does fit on +a lane. Design decision: - Pack local crosslets into a single payload if N_crosslets > 1, because then teh packet overhead - is much reduced to support transporting more crosslets per lane (11 instead of 5). + Pack local crosslets into a single payload if N_crosslets > 1, because then the relative packet + overhead is much reduced to support transporting more crosslets per lane (11 instead of 5). Maximum number of crosslets per correlator cell: -A X_pn correlator cell can correlate N_fft / X_sq = 1024 / 144 = 7 different crosslets frequencies. +An X_pn correlator cell can correlate N_clk / X_sq = 1024 / 144 = 7 different crosslets frequencies. With N = 16 for LBA, there need to be N/2 + 1 = 9 of these X_pn correlator cells in parallel. One -X_pn correlates the local-local crosslets and the other N/2 X_pn correlates the local-remote -crosslets. These 9 X_pn in parallel can correlate up to 7 crosslets. The link can transport -maximum 11 crosslets. Hence the processing capacity of 9 X_pn is less than the IO capacity of 1 -10GbE lane, therefore 9 X_pn in parallel can correlate 7 different crosslets. The crosslet data rate -on a lane is then ((60 + 7 * 48) * 8b) * 195312.5 * 16/2 = 4.95 Gbps, so a utilization of 4.95 / -7.8125 = 63 %. Another set of 9 X_pn could be used to correlate the remaining 11- 7 = 4 crosslets -that can be transported via the ring. However, if more than N_crosslet = 7 crosslets need to be -correlated in parallel per integration, then it is easier to allocate an extra lane and to -instantiate an extra set of 9 X_pn to correlate 14 crosslets in parallel in total. +X_pn correlates the local-local crosslets and the other N/2 = 8 X_pn correlate the local-remote +crosslets. These 9 X_pn in parallel can correlate up to 7 different crosslets. The link can +transport maximum 11 crosslets. Hence the processing capacity of 9 X_pn is less than the IO +capacity of one 10GbE lane, therefore 9 X_pn in parallel can correlate 7 different crosslets. +The crosslet data rate on a lane is then ((60 + 7 * 48) * 8b) * 195312.5 * 16/2 = 4.95 Gbps, so a +utilization of 4.95 / 7.8125 = 63 %. Another set of 9 X_pn could be used to correlate the remaining +11 - 7 = 4 crosslets that can be transported via that lane. However, if more than N_crosslet = 7 +crosslets need to be correlated in parallel per integration interval, then it is easier to allocate +an extra lane and to instantiate an extra set of 9 X_pn to correlate 14 crosslets in parallel in +total. One X_pn takes one complex multiplier. For N_crosslets = 1 crosslet per integration interval using -N/2+1 = 9 X_pn uses only 144 / 1024 = 14% of the processing resources. However this is acceptable +1 + N/2 = 9 X_pn uses only 144 / 1024 = 14% of the processing resources. However this is acceptable because: - the FPGA has sufficient multipliers - it provides a clear design - the spare capacity can be used to process more crosslets per integration interval Design decision: - Use 1 + N/2 = 9 parallel correlator cells to correlate N_crosslets = 1 crosslet, or upto 7 + Use 1 + N/2 = 9 parallel correlator cells to correlate N_crosslets = 1 crosslet, or upto 7 crosslets in parallel, per integration interval. - Send more than one time slot per packet? To reduce the relative packet overhead for single crosslet XC it is an option to put multiple time slots per payload. Design decision: This is considered to complicated. What if a packet gets lost? -The local crosslets cannot get lost, but remote packets may get lost. The BSN aligner will repolace -lost remote packets with filler packets that are flagged. The crosslets in the filler packets -contain zero data, so in the correlator they do not contribute to the visibilities. Each correlator -cell has to count the number of valid and flagged crosslets per integration interval. The number -valid crosslets N_valid can be used to weight the visibility relative to the expected number of -N_int crosslets. The number of flagged crosslet N_flagged is used for monitoring. For every -integration interval N_valid + N_flagged = N_int, by design of the BSN aligner. - +The local crosslets cannot get lost, but remote packets may get lost. For transit crosslet packets +a lost packet remains lost, because it cannot be replaced. For the subband correlator at this +node the lost remote packets can be replaced by filler data, because the BSN aligner can use the +local input as reference to detect lost packets. The BSN aligner will replace lost remote packets +with filler packets that are flagged. The crosslets in the filler packets contain zero data, so in +the correlator they do not contribute to the visibilities. Each X_pn correlator cell operates on +crosslets from another source. Therefore each X_pn correlator cell has to maintain a count of the +number of valid N_valid and of the number of flagged N_flagged crosslets per integration interval. +The N_valid can be used to weight the visibility relative to the expected number of N_int +crosslets. The N_flagged is used for monitoring. For every integration interval N_int = N_valid ++ N_flagged should be true, by design of the BSN aligner. -Time in diagrams: -- equal time for all PN in same row and in same relative column -- left to right time in time slot -- top to bottom time slots - - PN0 PN1 PN2 PN3 PN4 -t -0: L0 L1 L2 L3 L4 <-- S_pn = 12 crosslets (single pol complex subband) per packet - R4 R0 R1 R2 R3 - R3 R4 R0 R1 R2 - <-- T_sub > latency on ring -1: L0 L1 L2 L3 L4 - R4 R0 R1 R2 R3 - R3 R4 R0 R1 R2 - -2: ... - For every slot intergate - 00 11 22 33 44 <-- XST first LL at each PN upon L arrival - 04 10 21 32 43 <-- XST then LR at each PN upon R arrival with L in barrel - 03 14 20 31 42 <-- XST then LR at each PN upon R arrival with L in barrel - -N_int-1: <-- Dump and restart XST: - - 0 00 10 20 * * - 1 - 11 21 31 * - 2 - - 22 32 42 - 3 03 - - 33 43 - 4 04 14 - - 44 - 0 1 2 3 4 - - * is obtained via conj() - - not calculated because conj() - - - - - - - What if T_sq > T_hop latency on ring? What if T_sub > N/2 * T_hop latency on ring? @@ -644,63 +700,90 @@ A04. After that the correlator can continue with B00 and then B07 when it arrive packet input also needs a buffer, to store B7 in case the correlation of B00 is still busy. - PN0 PN1 PN2 PN3 PN4 PN5 PN6 PN7 -t -0: L0 - R7 - R6 - R5 - R4 - -0: A0 - A7 -1: B0 - A6 - B7 -2: C0 - A5 - A4 - B6 - C7 -3: D0 - B5 - B4 - C6 - D7 -4: E0 - C5 - C4 - D6 - E7 -5: F0 - D5 - D4 - E6 - F4 - E5 - E4 - - A00 - 07 - 06 <-- queue local B0 in FIFO, because first finish time slot A for A6,5,4 - 05 <-- queue remote B7 in FIFO, because first finish time slot A for A5,4 - 04 - B00 - 07 - 06 <-- queue local C0 in FIFO, because first finish time slot B for B6,5,4 - 05 <-- queue remote C7 in FIFO, because first finish time slot B for A5,4 - 04 - -If remote packets get lost then the local FIFO will run full, this can then be used to flush -the FIFOs and restart the alignment. The flush time should be long enough, such that it will -cause that all PN in the ring will restart. However it is important that all PN restart at -the same time or using the same time slot. This can be achieved by restarting at the sync -(so once per second) or by restarting at every time slot in case the previous time slot did -not receive any remote packet, which indicates that the source node was still flushing its -FIFOs. +XC ring transport: +The crosslet packets have to travel accross N/2 hops along the ring. At each node all remote +crosslet packets are decoded and stored-and-forwarded to be able to validate the ETH CRC, the DP +CRC and the BSN. Each node inserts its local crosslet packet onto the ring and removes the +most distant remote packet from the ring. The other remote crosslet packet are passed on. +For the remote packets that are passed on the payload can remain packed. For the local crosslet +packets the payload needs to be repacked from 2 * 16b crosslets to 64b packed data. With +N_crosslets = 1 crosslet per integration interval the P_packet = 108 octets or 14 64b words. +The packed remote and local crosslets payloads can then be multiplexed and then encoded. +The muliplexer that inserts the local crosslet packet uses a round robin scheme, so that the +local crosslet packet can be inserted as soon as there is a gap between the remote packets. +Therefore assume the multiplexer may cause an extra latency t_mux of one packet duration at +one node, but not at all nodes. The Tx repacking of the local crosslets causes gaps in the 64b +data, because r = 200/156.25 * 32/64 = 0.64 < 1. Therefore the clock domain crossing Tx FIFO +needs to use a fill FIFO. The Tx FIFO will first need to be filled by about (1-r) * P_packet = +(1 - 0.64) * 14 = 6 64b words, before the Tx can start to ensure that the packet is transmitted +without gaps. The travel latency t_travel per hop can be expressed in DP clock cycles at 200 MHz: + +- t_pipe + t_prop ~= 0.2 us, or 40 clock cycles. +- t_store ~= 14 / 156.25M = 0.09 us, or 18 clock cycles +- t_fill ~= 6 / 156.25M = 0.038 us, or 8 clock cycles. + +Hence t_travel = 40 + 18 + 8 = 66 clock cycles. Assume t_mux is equal to t_store at one node and +0 at the N/2-1 other nodes. The total latency along the ring for the most distant remote packet +is then about N/2 * travel + t_mux = 16/2 * 66 + 18 = 546 clock cycles, so less then N_clk = 1024. + +The XC dispatcher does: +- repack the remote crosslets from 64b data to 2 * 16b crosslets +- demultiplex the crosslets from the N/2 different sources +- align the local crosslets input with the N/2 remote inputs. +- output the local-local crosslets and the N/2 local-remote crosslets to the 1 + N/2 correlator + cells + +XC BSN aligner input buffer size: +The maximum input latency is less than one block period T_sub, so the size of the input buffer +in the BSN aligner only needs be 2 blocks. While one block free for new blocks, the other +aligned block is being output to the correlator cells. This yields an input buffer size of K = 2 +blocks, so K * S_pn * N_complex * W_subband = 2 * 12 * 2 * 16 = 768 bit, which takes one M20k +block RAM. The BSN aligner for the crosslet ring dispatcher has 1 + N/2 inputs, so it will use +9 * 1 = 9 M20k block RAMs. Note that these BSN aligner input buffers are also large enough to +fit N_crosslet = 7, because 7 * 768 = 5376 bit also fits in one M20k block RAM. + +If the BSN aligners allows direct memory access to its input buffers then the X_sq square +correlator cell can read the crosslets from the BSN aligner in arbitrary order and multiple +times. + +X_sq correlator cell: +The X_sq correlator cell has two input streams. One input stream delivers the crosslet from +S_pn = 12 signal inputs on one PN and the other input stream delivers the crosslet from +S_pn = 12 signal inputs on the same PN (for local-local visibilities) or another PN (for the +local-remote visibilities). In total the X_sq calculates X_sq = S_pn * S_pn = 12*12 = 144 +visibilities. The crosslets are delivered sequentially using a double for loop, so for each +crosslet i in range(S_pn) on one input and for each crosslet j in range(S_pn) on the other +input calculate the product and intergrate the visibility. This calculation sequence requires +that crosslets can be addressed multiple times. For N_crosslets = 1 the X_sq correlator cell +only correlates the first S_pn = 12 crosslets that are delivered on its two inputs. For +N_crosslets > 1 the X_sq continues correlating the next S_pn = 12 crosslets that are delivered +on its two inputs. Hence N_crosslets > 1 merely adds another for loop level to the X_sq, that +loops for k in range(N_crosslets). The visibilities are calculated in order: + k, i, j + 0, 0, 0 + 0, 0, 1 + . . . + 0, 0,11 + 0, 1, 0 + 0, 1, 1 + . . . + 0, 1,11 + . . . + . . . + 0,11, 0 + 0,11, 1 + . . . + 0,11,11 + 1, 0, 0 + etc. + Support for other (shorter) integration period T_int_x? - Longer T_int as multiple of 1 s can be supported outside SDP +- Longer T_int can be supported within SDP by: + . Using BSN scheduler + . Reduces M&C data rate + . Should still fit in number of bit of visibility - Shorter T_int < 1 s (PPS): . Using BSN scheduler . increases M&C data rate @@ -715,14 +798,76 @@ How can it be scaled to more than one crosslet per XST? ******************************************************************************* * Subband offload for AARTFAAC ******************************************************************************* -Current AARTFAAC can offload S_sub_so = 36 subbands for S = 96 signal inputs (SI) in W_subband_so = 16 bit mode, -so a bandwidth of 36 * 1953125.5 Hz = 7.03 MHz. This corresponds to a load of S_sub_so * S * f_sub * N_complex * -W_subband_so = 36 * 96 * 195312.5 * 2 * 16 = 21.6 Gbps. The 8 bit subband mode does not work in RSP, but would -be sufficient for AARTFAAC. Therefore assume W_subband_so = 8 bit for LOFAR 2.0. For LOFAR 2.0 the number of LBA -doubles to S_lba = 192, so assume S = 192. The load from one 8 bit subband from all 192 signal inputs is -S * f_sub * N_complex * W_subband_so = 192 * 195312.5 * 2 * 8 = 0.6 Gbps for R_os = 1 and 0.75 Gbps for maximum -expected R_os = 1.25 of an oversampled filterbank. Per 10GbE output link this then yields maximum of 10G / 0.6G -= 16.6 subbands for R_os = 1 and 10G / 0.75G = 13.3 subbands for R_os = 1.25. The 10GbE requires some spare + +Assumptions for AARTFAAC2.0: + +- S = 96 signal inputs +- W_subband_so = 8 bit +- S_sub_so = 64 subbands, so 12.5 MHz subband bandwidth +- group subbands from all S = 96 inputs in a packet +- similar subband output format as in ASTRON_RP_1403_UDP_SDO ICD + +LOFAR1 uses the outer LBA for about 80 % of the time and the inner LBA for 20 % of the time. This +is because at lower frequencies the mutual coupling of LBA in the inner region becomes more +significant, which then reduces the sensitivity of the inner LBA. The mutual coupling increases +and the sensitivity decreases because for nearby LBA the wavelength >~ the distance between LBA. + +Assumptions for Station.SDP: +- Any S = 96 out of S_lba = 192 can be selected for offload +- The number subbands per lane is independent of set the same for R_os = 1 and R_os = 1.28. This + implies that the utilization of the lanes for R_os = 1 is about a factor 1.28 less. + +Select S = 96 from S_lba = 192 signal inputs +AARTFAAC uses the dual pol antennas, so the signal inputs (SI) have to be selected per pair of X +and Y polarization. The N = 48 antennas can be selected from the N_lba = 96 antennas in different +either at the offload node or at each PN: +- Transport all SI to the offload node and select there +- Select SI per PN and only transport the selected SI to the offload node +The first schemeThe selection can be programmable or fixed. + + +First collect all S_lba = 192 signal inputs at the offload node, and then make an arbitrary + selection or a fixed selection. The disadvantage is that this doubles the load on the ring. +- Select at each PN and transport only First collect all S_lba = 192 signal inputs at the offload node, and then make an arbitrary + selection or a fixed selection. + +Use ring transport scheme 1 or scheme2a: +- With scheme 1 the selection of S out of S_lba can be made per PN, as the payload is passed along + and each node can insert none, all or a subset of its S_pn at the allocated subband index in the + payload. With scheme 2a the selection of S will be done at the offload node, so all PN then send + all their S_pn inputs via the ring. This doubles the load on the ring. +- With scheme 2a each node only has to pass on the remote packets, but at the offload node it + needs an N input BSN aligner, an N input to one output subband selection to get the offload + payload. With scheme 1 the first node initiates the offload payload and then each node has to + insert the local subbands at the correct index. This requires only a two input BSN aligner. +- If one hop fails in scheme 1 then there is no offload. If one hop fails in scheme 2a then there + is still offload from subsequent hops. + + + +Current AARTFAAC1 can offload S_sub_so = 32 subbands for S = 96 signal inputs (SI) in W_subband_so += 16 bit mode. On the RSP - Uniboard interface there are 9 subbands per lane, so S_sub_so = 36 in +total, but on the UniBoard - UDP interface to the GPU correlator only 8 subbands, so 32 in total +are output. The AARTFAAC1 output load is S_sub_so * S * f_sub * N_complex * W_subband_so = +32 * 96 * 195312.5 * 2 * 16 = 19.2 Gbps. Due to a bug in probably the RSP firmware, W_subband_so += 8 bit mode cannot be supported, but for LOFAR2.0 it can. Hence for the same output load as +AARTFAAC1, AARTFAAC2.0 can offload S_sub_so = 64 subbands, which corresponds to a bandwidth of +64 * 195312.5 Hz = 12.5 MHz. + +For LOFAR 2.0 the number of LBA doubles to S_lba = 192, but AARTFAAC2.0 assumes that still S = 96 +will offload subbands. Assume that the S = 96 signal inputs can be selected from the S_lba = 192 +available signal inputs at the Station output. Therefore internally in the Station SDP the +subbands from all S_lba are passed on via ring to an the output node in SDP. For the LBA the ring +in SDP connects N = S_lbs / S_pn = 192 / 12 = 16 nodes, so N-1 hops. Assume all subbands are send +in one direction along the ring. The subband data load on the last hop is then +(N-1)/N * 2 * 19.2G = 15/16 * 2 * 19.2G = 36.0 Gbps, excluding packet overhead. Given a lane +load capacity of L_lane = 7.8125 Gbps, this implies that the subband offload requires at least +ceil(36.0 / 7.8125) = ceil(4.6) = 5 lanes. + +The load on the from one W_subband_so = 8 bit subband is L_sub_so = S_lba * f_sub * N_complex * +W_subband_so = 192 * 195312.5 * 2 * 8 = 0.6 Gbps. Per 10GbE lane this then yields maximum of +L_lane / L_sub_so = 7.8125G / 0.6G = 33.3 subbands for +R_os = 1 and 10G / 0.75G = 13.3 subbands for R_os = 1.25. The 10GbE requires some spare capacity, so therefore assume S_sub_so = 12 subbands / 10GbE link will just fit for R_os <= 1.25, provided that the packet overhead is < (13.3-12)/12 ~= 10 %. Hence with one 4 * 10GbE QSFP port at the final PN it is possible to offload 4 * 12 = 48 subbands or 9.375 MHz bandwidth with S_lba = 192 signal paths and W_subband_so = 8 bit. @@ -730,8 +875,6 @@ The ring can be used to transport the subbands to some single destination PN tha the 4 x 10GbE ports or 40GbE port on the QSFP. The destination PN could also do subband reordering to group subbands per S_lba = 192 inputs. -Remark: On the RSP - Uniboard interface there are 9 subbands per lane, so S_sub_so = 36 in total, but on the -UniBoard - UDP interface to the GPU correlator only 8 subbands, so 32 in total are output. The subbands are gathered at the output node via the ring. Using the ring avoids the need to use a 10GbE switch. Such a switch would need > 16 + 16 ports to support LBA + international HBA and some output ports. If the data @@ -770,7 +913,8 @@ to the rsp_terminal function on UniBoard1 for AARTFAAC. Scheme is specific to th work if the subband data is send to the end node via a switch (or via URI like with RSP). With scheme 2a the ring could be used in both directions, but this does not improve the capacity of the -ring. With scheme 1 the packets travel 1+2+3+...+(16-1) = 120 hops. With scheme 2a the packets travel +ring. With scheme 2a in one direction the packets travel 1+2+3+...+(16-1) = 120 hops. With scheme 2a in +both directions the packets travel 1+2+3+4+5+6+7+8 = 36 hops left and 1+2+3+4+5+6+7 = 28 hops right, so total 64 hops. For the transport load on the ring as a whole scheme 2 is a factor 102/64 = 1.875 more efficient. However at the end node both schemes still have transfer the same load of 15 packets. Therefore at the end node the load for both @@ -790,7 +934,7 @@ Suppose 8 of these can be allocated to subband offload, then the ring can supppo Design decision: - Gather subbands at output node (instead of having a dedicated offload port at each node) -- Gather the subbands via the ring (to avoid the need for a 10GbE switch wit about 40 ports) +- Gather the subbands via the ring (to avoid the need for a 10GbE switch with about 40 ports) - Reorder the subbands to have all subbands from signal inputs in one payload (to ease input stage of user application) - Use scheme 2a and in both directions (to reduce the number of hops and latency) diff --git a/applications/lofar2/doc/prestudy/station2_sdp_timing.txt b/applications/lofar2/doc/prestudy/station2_sdp_timing.txt index ac488b6e9b..060a82f773 100644 --- a/applications/lofar2/doc/prestudy/station2_sdp_timing.txt +++ b/applications/lofar2/doc/prestudy/station2_sdp_timing.txt @@ -2,42 +2,43 @@ * Fixed Station BSN grid and the PPS grid ******************************************************************************* -The Station needs an external trigger to align all ADCs in the RCU2S and all FPGA procesing nodes in the SDP. -For this trigger a pulse from the pulse per second (PPS) is used. The PPS is aliged to the top of second of the -UTC time of day (ToD). The PPS is a hardware trigger that is available within the entire SDP at sample clock -cycle accuracy. Thanks to the Timing Distributor (TD) the PPS trigger is also available as hardware trigger in -all Stations. Thanks to the TD the PPS is aligned to UTC ToD, and the ToD is available to the Telescope Manager -(TM) in LOFAR2.0 and to Station Control in each Station. The TM controls, via Station Control, which PPS pulse -is used to start SDP. The PPS is identified by a Seconds Sequence Number (SSN) that counts PPS since a certain +The Station needs an external trigger to align all ADCs in the RCU2S and all FPGA procesing nodes +in the SDP. For this trigger a pulse from the pulse per second (PPS) is used. The PPS is aliged to +the top of second of the UTC time of day (ToD). The PPS is a hardware trigger that is available +within the entire SDP at sample clock cycle accuracy. Thanks to the Timing Distributor (TD) the +PPS trigger is also available as hardware trigger in all Stations. Thanks to the TD the PPS is +aligned to UTC ToD, and the ToD is available to the Telescope Manager (TM) in LOFAR2.0 and to +Station Control in each Station. The TM controls, via Station Control, which PPS pulse is used to +start SDP. The PPS is identified by a Seconds Sequence Number (SSN) that counts PPS since a certain date in the past, e.g. t_epoch = 1 jan 1970, but some other fixed date is possible too. -The SDP processes the data in blocks of ADC samples that are identified by a Station block sequence number -(BSN). The Station BSN time grid should be fixed, so independent of when the data processing starts. Therefore -the Station BSN counts blocks since the same t_epoch as the SSN, so the t_epoch defines the common reference -moment in history for the Station BSN grid and for the PPS grid. The PPS grid does not necessarily always -coincide with the Station BSN grid. The BSN period determines whether the Station BSN can start exactly at an -PPS or not. +The SDP processes the data in blocks of ADC samples that are identified by a Station block sequence +number (BSN). The Station BSN time grid should be fixed, so independent of when the data processing +starts. Therefore the Station BSN counts blocks since the same t_epoch as the SSN, so the t_epoch +defines the common reference moment in history for the Station BSN grid and for the PPS grid. The +PPS grid does not necessarily always coincide with the Station BSN grid. The BSN period determines +whether the Station BSN can start exactly at an PPS or not. The processing of the ADC inputs in SDP is done by multiple FPGAs in parallel. Each FPGA has a BSN source that creates the Station BSN grid. The BSN source is the wall clock of the FPGA. To be able to start the data processing at any PPS it is necessary that the BSN source can start at a programmable fraction of a BSN period after the PPS. In this way processing of one Station ADC signal input can be restarted at any PPS (with zero -phase offset to the other signal input) and an entire Station can be restarted at any PPS (with zero phase +phase offset to the other signal input) and an entire Station can be restarted at any PPS (with zero phase offset to the other Stations). The BSN source ensures that the BSN timing is always on the fixed Station BSN time grid. The initial BSN and the offset fraction of a BSN period need to be provided to SDP via the M&C interface by Station Control. Both Station Control and SDP know the PPS grid. Station Control also knows UTC and with that information Station Control can program and initialize the BSN source in the FPGAs to start -counting data blocks at the next PPS. +counting data blocks at the next PPS. The sample frequency f_adc = 200 MHz is an integer number of Hz and locked to the PPS, therefore the PPS grid always coincides with the ADC sample period T_adc = 1/f_adc = 5 ns grid. The Station BSN block period is an integer number of N_blk sample periods T_adc. The Station BSN period is set by the subband rate of the subband -polyphase filterbank (PFB), so the Station BSN period is equal to the subband period is T_sub. -The input to the subband filterbank is the real signal from the ADC. For both the critically sampled PFB and +polyphase filterbank (PFB), so the Station BSN period is equal to the subband period is T_sub. +The input to the subband filterbank is the real signal from the ADC. For both the critically sampled PFB and the oversampled PFB the data block size of the input data is N_blk = N_FFT = 1024 ADC samples, however for the oversampled PFB the blocks overlap by a factor R_os. Hence for the critically sampled PFB the BSN period is T_sub = N_blk = 1024 [T_adc] and for the oversampled PFB the BSN period is T_sub = N_blk / R_os = 864 -[T_adc], in case R_os = 32/27. At the output of the subband filterbank a data block contains +[T_adc], in case R_os = 32/27. At the output of the subband filterbank a data block contains N_sub = N_FFT / N_complex = 512 complex subband samples that all correspond to the same time instant as defined by the Station BSN. Each subband sample represents another frequency. @@ -50,17 +51,17 @@ of 0:N_blk-1 sample periods. The BSN source starts at a PPS with an initial Stat and a BSN offset fraction of: BSN offset fraction = mod(SSN * 1 s, T_sub) / T_adc = mod(SSN * f_adc, N_blk) - + to make sure that the BSN grid is always relative to t_epoch, independent of at which PPS the BSN source was started. The Station BSN increments after every block. The time ToD_BSN at the BSN grid is: ToD_BSN = t_epoch + Station BSN * T_sub - + Note: - The BSN offset fraction could also be compensated for by delaying the sample data in the ADC signal input buffers at the input of SDP. Delaying the data does compensate for phase differences in the subband - data, but does not compensate for the offset in the BSN grid. The BSN alignment buffers between signal + data, but does not compensate for the offset in the BSN grid. The BSN alignment buffers between signal inputs from different FPGAs then still need to compensate for this BSN offset fraction. Hence delaying the data is an indirect and incomplete solution, and therefore it is not used. - In LOFAR1 the Station BSN is divided in a 32 bit seconds sequence number (SSN) that counts PPS intervals and @@ -80,7 +81,7 @@ and any M&C upon the data, because: - It is not necessary to facilitate using an offset 0 < T_sub_o < T_sub to start the BSN grid at an integer number of T_adc after t_epoch, because the BSN grid is sufficiently fine. -- It is not necessary to represent fine group delays of digital filters or analogue electronics and +- It is not necessary to represent fine group delays of digital filters or analogue electronics and cables in the BSN, because these delays are all accounted for after calibration. . Course group delays and cable delay differences can be compensated for in steps to T_adc via the signal input buffer of every ADC input in SDP. @@ -106,7 +107,7 @@ In LOFAR2 the timestamp should be independent of: - using 200 MHz sample rate or 160 MHz sample rate, - using critically sampled subband filterbank or oversampled subband filterbank - + If T_sub was fixed then T_sub could be used as timestamp resolution (like in APERTIF). However T_sub depends on the type of subband filterbank with a resolution of T_adc. If T_adc was fixed then T_adc could be used as timestamp resolution. However T_adc depends on the sample clock rate. Therefore the timestamp resolution @@ -118,11 +119,11 @@ of 0.2 ns such that they are: * integer values, and * independent of the sample period. - + The actual timestamp in fractional seconds of 0.2 ns follows from: timestamp = Station BSN * T_sub_i * 0.2 [ns]. - + The BSN and T_sub_i can be specified as: - single 64 bit integer timestamp value of BSN * T_sub_i [0.2 ns] @@ -131,7 +132,7 @@ The BSN and T_sub_i can be specified as: To cover 116 years for a BSN with smallest T_sub = 4000 ns for R_os = 32/25 = 1.28 requires: log2( 116 * (365.25 * 24 * 3600 / 4000e-9) ) = 49.7, so 50 bits - + Therefore allocate 64b in a packet header to send the BSN information. The BSN and timestamp are direcly related via T_sub_i, but the advantage of providing the BSN separately is that it increments by 1 for each block period T_sub, so it can be used as block index. @@ -155,12 +156,12 @@ of the data in a Station. However counting blocks is not sufficient to maintain The assumptions are: - data is transported and processed in blocks, -- partial blocks cannot occur. +- partial blocks cannot occur. - the data flow can only stop or continue at block boundaries. To recover from gaps in the data flow the BSN can be transported along with every data block. For the external FPGA interfaces one or more data blocks get packed into the payload and the BSN is then -transported via the header. The BSN in the header corresponds to the first data block in the payload, the +transported via the header. The BSN in the header corresponds to the first data block in the payload, the position of a data block in the payload defines the offset to this BSN. For data transport within the FPGA it is costly from a resource point of view to tranport the 64 bit BSN @@ -191,10 +192,10 @@ For the blocks between sync pulses the Station BSN is incremented with every blo BSN needs to be preserved during the sync interval, then lost or discarded blocks must be replaced by filler blocks. Whether only the BSN at the data sync is relevant, or whether also the BSN of subsequent data blocks is needed depends on the function. For the statistics (AST, SST, BST, XST) the BSN at the data sync is sufficient to -mark the timing of integration results. For these integration results the number of data blocks within the +mark the timing of integration results. For these integration results the number of data blocks within the integration interval is relevant to know how many blocks contributed (and thus also how many blocks were lost). However for the integration result it is not relevant which blocks got lost, because the statistics do not have -to keep accurated time centroid information. It is sufficient to use the BSN at data sync to timestamp the +to keep accurated time centroid information. It is sufficient to use the BSN at data sync to timestamp the integration results, as if all blocks contributed. As another example, for the beamformer it is important to be able to recreate the BSN at the data sync and all subsequent data blocks, because the beamformer must weight and sum the input beamlets that coincide in time. Similar for the beamformer output to CEP and for the subband @@ -253,7 +254,7 @@ because it means that Stations should start at an even sync interval when the PP to ensure that all stations remain aligned. Starting only at even PPS ensures that LOFAR1 uses a BSN grid that is fixed to t_epoch = 1970. For an oversampled subband filterbank the PPS grid and BSN grid coincide every q-th PPS, where R_os = p/q, so then a Station should only start every q-th PPS. -In APERTIF the sync interval was chosen to be an integer number of fine channel periods T_chan = +In APERTIF the sync interval was chosen to be an integer number of fine channel periods T_chan = N_Chan * T_sub, which resulted in 12500 T_chan and 800000 T_sub or a period of 1.024 s. This 1.024 s is used as unit integration period of the correlator in APERTIF. A sync interval of 1 s would have resulted in 781250 T_sub and 12207.03125 T_chan. The APERTIF sync interval of 1.024 s is akward too, because it differs @@ -262,7 +263,7 @@ grid coincide, which is once every 125 s, because 128/125 = 1.024. LOFAR1 and APERTIF show that in general application periods do not integer fit with the 1 s PPS grid. For integration periods the only two options are to either use another integration interval (like 1.024 s in APERTIF) or to accept that the number of samples per integration interval can differs by one (like 195313 or -195312 in LOFAR1). +195312 in LOFAR1). Both LOFAR1 and APERTIF cannot start at any PPS without affecting the BSN grid. This needs to be solved for LOFAR2.0. Like in LOFAR1, for LOFAR2.0 the PPS grid and BSN grid are fixed to t_epoch = 1970. However, instead of waiting until the BSN grid and PPS grid coincide, the BSN source in @@ -339,14 +340,14 @@ In Station SDP the BSN serves two purposes; - the entire BSN provides wall clock time on the BSN grid and is thus linked to UTC, - the difference in BSN is used to time align input streams -In Station SDP each FPGA has a local BSN source and a local stream that carries the data from its local ADC +In Station SDP each FPGA has a local BSN source and a local stream that carries the data from its local ADC signal inputs. The BSN aligner needs to align the local stream with the remote streams that are received from the other FPGA via the ring. The maximum BSN latency on the ring depends on the number of FPGAs in the ring. Suppose each FPGA introduces a latency of at least one packet, because it applies store and forward on packets, and less than two packets. Furthermroe assume that on the ring each packet contains one data block. The maximum number of hops between the first FPGA on the ring and the final FPGA is N_FPGA-1. For the LBA ring N_FPGA = 16. Hence the maximum BSN latency that can occur within -SDP is < (N_FPGA-1) * 2 < 32 block periods. Hence the maximum BSN difference between the local input and a +SDP is < (N_FPGA-1) * 2 < 32 block periods. Hence the maximum BSN difference between the local input and a remote input of the BSN aligner is < 32. Therefore to align the input streams the BSN aligner only has to compare the log2(32) = 5 LSbits of the BSN of all input streams. This implies that for the BSN aligner it would be sufficient to only transport these bits in the packet header, however it is convenient and not too @@ -371,6 +372,28 @@ Note: +Key ideas: +- Use Ethernet CRC and DP CRC to ensure detection of packet errors and to ensure error free blocks + within FPGA firmware +- Within SDP firmware the BSN at sync can be obtained from the local BSN source and subsequent + BSN can be derived by counting blocks: + . Use filler blocks to replace lost packets, to maintain BSN count within FPGA firmware + . Use local BSN source in FPGA and pass on sync within SDP firmware to know the BSN in the firmware. + + + RCU2 Subband Ring + PFB + + data data + data ------> BSN --------> Move, -------> Packet + PPSH ------> source sync DSP sync encoding + BSN .........> BSN ring + + Ring BF, XC + data data data + Packet --------> Validate --> Validate --> BSN --------> Move, --------> Packet + decoding sync CRC BSN aligner sync DSP sync encoding + ring BSN .......................................................> BSN output @@ -386,9 +409,7 @@ Design decisions: but in simulation it can be much less. - Use central UTC timestamp at PPS initialized by M&C and incremented by SDP firmware for the SSN per FPGA. -- Use 32 bit SSN to fit UTC in seconds for 136 years since 1970 -- Use local BSN that counts data blocks within a sync interval, so it restarts at 0 at the internal sync -- Within SDP transport the sync and the local BSN. The sync is transported via the MSbit of the local BSN. - At the sync transport the 31 bit SSN instead of local BSN 0, but only for monitoring purposes. -- Derive 64 bit UTC timestamp in units of T_sub in SDP firmware and use this for data output to CEP +- Use 64 bit continuous BSN that counts subband periods since 1970 +- Within SDP transport the sync and the BSN. The sync is transported via the MSbit of the BSN. + -- GitLab