Added txt documents that prestudy for the detailed design of the SDP firmware.

d78945af · Eric Kooistra · 17853a8a · d78945af · d78945af · d78945af
Commit d78945af authored 5 years ago by Eric Kooistra
--- a/applications/lofar2/doc/lofar_station_firmware_model.vsd
+++ b/applications/lofar2/doc/lofar_station_firmware_model.vsd
--- a/applications/lofar2/doc/prestudy/desp_hdl_design_article.txt
+++ b/applications/lofar2/doc/prestudy/desp_hdl_design_article.txt
+. Distinguish beteen state registers and pipeline registers.
+  The state registers keep the state of the function and the function itself is programmed in combinatorial logic.
+  In this way the pipelining that is needed to achieve timing closure can be added independent of the function.
+  This approach could be described in a paper, because it is quite significant and differs from the well known
+  Gailser approach (that uses RL=1 and does not separate state from pipeline). AXI uses RL=0 but need to check 
+  how it then handles pipelining.
+. Components need pipelining to achieve timing closure. This pipelining causes a latency in the data
+  stream. This latency is typically no problem, because it only delays the output. If components need
+  flow control then the stream has a siso backpressure signal that must have a certain timing relation
+  to the sosi data signal. This timing relation is the ready latency (RL) and the RL can be >= 0. For 
+  RL = 0 the ready signal acts as a data acknowledge and for RL > 0 the ready signal acts as a data
+  request signal. Adding pipelining to the sosi data increases the RL.
+. The RL is explained in the Avalon specification. An example of RL = 0 are so called look ahead (Altera)
+  or first word fall through (Xilinx) FIFOs. In our UniBoard applications we use RL = 1. For most parts
+  of the design we try to not use flow control. I think that the Axi stream use RL = 0.
+. The function operates with ready latency (RL) = 0, if it is combinatorial. If the stream has no flow
+  control then the pipeline is achieved as an output register stage. If the stream does need flow control,
+  then this output register stage increases the RL by 1. To restore the RL to 0 a dp_latency_adapter.vhd
+  is needed. This latency adapter also registers the ready, so it provides pipelining for both the output
+  stream sosi data  as well as the output stream siso ready flow control.
+. For new components the development approach implement the function for RL=0, so only with the state
+  registers. If the component does not use flow control, then it may still just wire the flow control
+  from output to input. If the component does use flow control than it can combinatorially impose this
+  on the incomming flow control and pass the combined flow control on to its input. For timing closure
+  the pipelining is added as a seperate stage. Either pipeline sosi if no flow control is needed
+  or pipeline siso if flow control is needed. For example: dp_block_resize.vhd, dp_counter.vhd.
+Ref:
+ tools/oneclick/doc/desp_firmware_dag_erko.txt : Secion 6) VHDL design
+ tools/oneclick/doc/desp_firmware_overview.txt
\ No newline at end of file
--- a/applications/lofar2/doc/prestudy/dupllo_aad_erko.txt
+++ b/applications/lofar2/doc/prestudy/dupllo_aad_erko.txt
+1) Introduction
+a) Focus on checking UniBoard2 based solution, because:
+- we need an FPGA to interface the ADC
+- using an existing board saves development time.
+b) Assumptions:
+- the subband filterbank will be implemented on the FPGA, because we need the FPGA anyway so then it can
+  also do some DSP
+- the station beamformer is implemented on the FPGA because for a LOFAR station the number of beams is
+  small, so this yields a larger data rate reduction to the subsequent processing (on CPU / GPU)
+- use a ring beamformer (like in Lofar 1.0), because it avoids having to use a large mesh (like in Apertif
+  BF) and because the ring can easily be extended with more nodes if necessary (like for the international
+  Lofar 1.0 stations).
+- Starting with critical sampled filterbank saves time, in design allow for oversampled filterbank
+c) New compared to Lofar 1.0
+- Same analogue band width and sample frequency ranges, but in total 4x more RCU input:
+  . 3 times more input due to simultaneous 2x LBA + 1x HBA
+  . ready for 4 times more input to support another 1x HBA input for Lofar Space Weather
+- Ready for output to Aartfaac 2.0
+d) Other relevant aspects
+- System requirements must be clear and complete at PDR, otherwise the project will delay due to unclarity
+- We used to have delay and therefore 'conflict' at end of project due to reactive, passive planning, now
+  we need 'conflict' at start of  project to be proactive and meet the end date.
+- X and Y and LBA and HBA can be implemented on independent hardware because they are processed
+  independently. The same firmware can be the same (apart from different parameter settings). Therefore
+  best use separate RCU for LBA and HBA, instead of combining LBA + HBA on one RCU, because otherwise they
+  will also need to be processed toghether in a subrack (assuming that the serial ADC link goes via a
+  backplane and not via a fiber.)
+- the current HBA uses DC power via one coax (x-pol) and control via the other coax (y-pol). The control
+  uses a propietory contral protocol based on Manchester encoding and implemented using a PIC micro
+  controller. The PIC microcontroller is a I2C slave.
+- all input should also be available at output of the FPGA to be future prove, this is a lesson learned
+  from Lofar 1.0 where only with high effort still only a small band could be made available for Aartfaac
+- during life time of 10 year FPGAs remain available, GPU will require an upgrade to a new version
+d) Development time:
+- Starting with critical sampled filterbank saves time, in design allow for oversampled filterbank
+- With a critical sampled filterbank like in Lofar 1.0 the new Lofar 2.0 station can operate together with
+  a Lofar 1.0 station
+- Much reuse from Apertif and RSP firmware
+- Some new aspects: 
+   . oversampled filterbank (oversample factor increases the output load)
+   . JESD serial ADC data interface
+   . how to connect RCU I2C control interface (via microprocessor with 1GbE on PAC)
+   . TBB function on UniBoard2
+   . separate MM clock domain and sample clock domains (160, 200 MHz)
+   . reuse M&C protocol from Gemini instead of UniBoard Control Protocol
+   . station correlator via TBB function or via crosslet statistics (similar as in RSP, Aperif PAF
+     correlator)
+- Detailed design must include M&C and test
+2) Oversampled filterbank:
+- See dupllo_oversampled_subband_filterbank.txt.
+3) TBB memory
+a) Lofar 1.0 (96 RCU for core and remote, 192 for interbnational)
+There is 1 TBB / 2 RSP, so 1 TBB / 16 RCU --> so 32 GByte/ 16 RCU = 2 GByte / RCU.
+With 200 MHz and assume 2 byte per sample this corresponds to 2 GByte / 0.2 GHz / 2 byte = 5 sec.
+b) 6 UniBoard1 (288 RCU)
+The largest DDR3 SODIMM that can fit on UniBoard is 16 GByte and each PN on UniBoard can have two DDR3
+SODIMMs. With 6 UniBoard1 there are 6 * 8 * 2 * 16 GByte = 1536 Gbyte for 288 RCU = 5.3 GByte / RCU.
+With 200 MHz and assume 2 byte per sample this corresponds to 5.3 GByte / 0.2 GHz / 2 byte = 10.3 sec.
+Uses 6 * 8 * 2 = 96 DDR3 SODIMMs. DDR3 can achieve 1.6 GTps @ 200 MHz.
+c) 3 or 4 UniBoard2 (288 RCU for 2x LBA + HBA, 384 including also 1x HBA for Space Weather)
+The largest DDR4 SODIMM that can fit on UniBoard2 is 36 GByte and each PN2 can have two DDR4 SODIMMs.
+With 3 UniBoard2 there are 3 * 4 = 12 PN, so in total 12 * 2 * 36 Gbyte = 864 Gbyte for 288 RCU = 3
+GByte / RCU. With 200 MHz and assume 2 byte per sample this corresponds to 3 GByte / 0.2 GHz / 2 byte
+= 7.5 sec. Uses 12 * 2 = 24 DDR4 SODIMMs. The required write rate per SODIMM is 288 RCU * 16b * 200 MHz
+/ 24 SODIMMs = 38.4 Gbps. The data width of the SODIMM is 64b (or 72b) so this is 38.4 Gbps / 64b =
+0.6 GTps, which is easily feasible, because DDR4 can achieve 3.2 GTps @ 400 MHz (transfers per second). 
+==> 1 UniBoard2 per 96 RCU can buffer 1.5 more transient data than Lofar 1.0. Possibly use factor 2 less
+    number of DDR4 or use smaller DDR4.
+d) UniBoard2 + external TBB storage cluster:
+Perhaps the TBB function can be implemented on an external storage cluster, because UniBoard2 can output
+all input. The total data rate is 288 * 200M * 16b = 288 * 3.2 Gbps = 921 Gbps, so with 2 ADC / 10GbE
+link this requires 144 10GbE links. For 10 s the cluster needs about 1152 TByte  = 72 * 16 Gbyte DDR
+modules.
+4) FPGA resource usage
+See Station ADD section 4.5.2.9
+[1] "HBA Control Design Description", LOFAR-ASTRON-MEM-175, apr 2010, E. Kooistra
+[2] "RSP Firmware Design Description", LOFAR-ASTRON-SDD-018, sep 2013, E. Kooistra
\ No newline at end of file
--- a/applications/lofar2/doc/prestudy/dupllo_oversampled_subband_filterbank.txt
+++ b/applications/lofar2/doc/prestudy/dupllo_oversampled_subband_filterbank.txt
+Oversampled filterbank:
+1) Purpose
+- to measure line spectra in the channels at the edges of a subband, could the AAF for Apertif be an alternative?
+- to use a synthesis filterbank on the beamformed data, why reconstruct the time series ?
+2) Working of analysis oversampled filterbank
+  PFB = PFIR -> FFT
+The polyphase filterbank (PFB) consists of a FIR prefilter (PFIR) and an FFT. The downsample factor is set by the FFT block size N_fft. For computational efficiency N_fft needs to be a power of 2, but a factor 3 or 5 may be included too. The PFIR section has N_fft phases and N_tap taps per phase. The coefficients follow from a low pass prototype FIR filter, as a snake pattern for all taps, for all points. In a criticaly sampled PFB the input data is shifted in in blocks of size N_fft. In an oversampled PFB the data is shifted in in blocks of size M and M < N_fft, so r = N_fft/M is the oversample factor. The shift less then N_fft causes a phase step between blocks in the PFB output. This phase step can be compensated by counter rotating the data that inputs into the FFT [harris, tuthil].
+The oversampling N_fft / M also implies that multiple PFB in parallel also need to keep aligned not only the N_fft blocks, but also oversampling sub blocks M. In ASKAP r = 32/27 with 1 MHz subbands causes that an integer number of fine channels periods takes 27 seconds, so causing a periodicity at large time scales to align at the human (and VDIF) 1 sec grid.
+    0                    f_s/2
+  |-.-|---|..............|---|
+    .
+  |-.-| f_sub/2
+    .
+  <-.-> N_chan
+    .  
+ |--.--| f'_sub/2
+    . 
+ <--.--> N'_chan
+For the critically sampled PFB the downsampled frequency per subband is f_sub = f_s / N_fft. In case of a real input their are N_sub = N_fft / 2 subbands, where the factor 2 is because for a real input only the positive and negative frequency spectra are complex conjugate, so only half of the subbands are unique.
+In the PFB this results in that each downsampled subband is centred around 0 Hz with subband sample frequency f_sub and complex subband samples. Hence for a complex signal the Nyquist sample rate is equal to the bandwidth, so the Nyquist factor 2 then appears in the fact that the signal is complex, so with 2 values (real and imaginary) per sample. 
+The subband bandwidth B_sub is determined by the PFIR and independent of the subband rate f_sub, so B_sub <= f_sub. The f_sub = f_s / N_fft defines the frequency grid. The f'_sub > f_sub makes it possible to oversample B_sub and to have B_sub = f_sub without aliasing. For the oversampled filterbank the f'_sub = r * f_sub. The subband bandwidth B_sub can be selected such that it is still almost flat up to f_sub and then drops down to the stop band level at f'_sub. The width of the transition region is set by r. ASKAP and SKA LFAA use r = 32/27 ~= 1.185. For two neighbour subbands the transition region to attenuate the aliasing is 2*(r-1)*f_sub. A larger oversampling factor r eases the PFIR filter for a required aliasing attenuation, but increases the data rate. 
+Oversampling does not change the frequency grid of the PFB, because the frequency grid is set by the FFT size. The oversampling only increases the sample rate per frequency bin (subband or channel) and this can be used to achieve more attenuation between neighbouring bins (subband or channel) to eliminate aliasing.
+   ----    ---- ^
+       \  /     .
+        \/      .
+        /\      .
+       /  \     .
+      /    \    v 
+      <->       aliasing attenuation
+        f'_sub 
+      f_sub
+The subbands (coarse channels) are again separated into smaller bandwidth channel (fine channels). The number of channels in f'_sub is N'_chan, so f'_chan = f'_sub / N'_chan. If f_sub = K * f'_chan then K * N_sub channels from the oversampled subbands provide a continuous flat spectrum, without aliasing between subbands. The N'_chan - K channels in transition regions are dropped. The channel PFB The FFT size of the channel PFB is equal to the number of channels N'_chan, because the channel PFB has complex subband input.
+Define r = p/q = N_fft/M where p and q are the smallest integers to represent r. 
+ f_sub = f'_sub/r = N'_chan * f'_chan / r = K * f'_chan
+ --> K = N'_chan / r = N'_chan * q / p
+Hence to fit the integer constrain for K both N_fft and N'_chan must be integer dividible by p. The q is free to choose, but must be integer and <= p.
+Beamforming is done per subband sample from S_ant inputs. The result is a beamlet, which can be regarded as a subband with direction. A subband may be used for multiple beam directions, so it results in a beamlet for each direction. For the subband and beamlet samples the data rate is a factor r higher, it is only after a channel PFB that the channels in the transistion band can be dropped.
+3) Compatibility with LOFAR 1.0
+In LOFAR 1.0 the subband PFB F_sub has N_fft = 1024, so N_sub = 512. The channel PFB F_chan has N_chan = 16, 64 or 256 channels. The 16 channels is use for pulsar timing (PST). In LOFAR 1.0 both F_sub and F_chan are critically sampled. Using r = p / q = 32 / 27 for LOFAR 1.0 with 64 channels fits and yields a spectrum with 54 channels per f_sub, so the channel width then increases by the oversample factor.
+To achieve the same width as for LOFAR 1.0 requires using r = 2 and N'_chan = 128, because r = p/q = 2/1 then yields N_chan = 64 channels per f_sub. Compared to a LOFAR 1.0 channel the phase slope over the channels from an oversampled F_sub will be a factor r less, due to that f'_sub = r * f_sub.
+I do not think it is possible to support LOFAR 1.0 channel width with an oversampled F_sub for r < 2. Also not with an oversampled channel PFB, because oversampling does not change the channel frequency grid. Using r = 2 does fit the existing LOFAR 1.0 frequency grid, but will cause a factor r = 2 higher output rate to CEP, because the data rate can only be reduced again after the channel filter. Therefore a solution can be to move the fine channel filter from CEP to the stations. 
+4) Required oversampling factor
+The required oversampling factor depends on the stop band attenuation and stop band bandwidth, and is a trade of between data rate and processing load. The N_fft = 1024 is a power of 2, so p in r = p/q also has to be a power of two, e.g.:
+32/28 =  8/7  ~= 1.143
+32/27         ~= 1.185  <-- used by ASKAP, LFAA
+32/26 = 16/13 ~= 1.231
+32/25 =       ~= 1.280
+32/24 =  4/3  ~= 1.333
+5) Working of synthesis oversampled filterbank
+Reconstruction from f'_sub (beamlets) or from f'_chan
+Why reconstruct to time series, to sperate to new channels?
+Reconstruct the whole band or only a part of the band e.g. 16 MHz for VLBI?
--- a/applications/lofar2/doc/prestudy/station2_opc_ua.txt
+++ b/applications/lofar2/doc/prestudy/station2_opc_ua.txt
+OPC-UA is  IEC 62541 standard
+Large open platform independent standard, but if only a subset of the features is supported, then the it becomes less
+standard or platform independent.
+OPC classic = Object Linking and Embedding (OLE) for Process Control
+OPC = Open Platform Communications.  
+OPC-UA = OPC Unified Architecture
+https://opcfoundation.org/
+http://wiki.opcfoundation.org/index.php/UA_Overview
+https://en.wikipedia.org/wiki/OPC_Unified_Architecture
+- Service oriented architecture (SOA) using asynchronous request/response pattern
+- transport: via TCP in binary or web based
+- data model: more than hierarchy of files/folder/registers, object oriented nodes that can send meta information and data
+- expandability via profiles:
+  . DI = device integration
+  . DA = data access
+  . A&C = alarms and conditions
+  . HDA = historical data access
+- security
+- authentication
+Needed:
+- OPC-UA SDK (software development kit)
+  . Considerations regarding Software Development Kits for OPC-UA:
+    http://www.ascolab.com/images/stories/ascolab/doc/ua_whitepaper_implementation_e.pdf 
+    - UA server requires at least ~200 kByte RAM
+  . https://documentation.unified-automation.com/uasdkhp/1.0.0/html/index.html
+- TCP/IP stack
+  . NicheStack (free via Intel)
+    https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/tt/tt_nios2_tcpip.pdf
+  . https://www.micrium.com/rtos/tcpip/
+  . Lightweight IP for UDP about 5.1 Mbps transmit and 3.4 Mbps receive using NiosII/f at 50 MHz, so less for TCP
+    https://www.ee.ryerson.ca/~courses/coe718/Data-Sheets/RTOS/tt_nios2_lwip_tutorial.pdf
+- RTOS (realtime operating system)
+  . MircoC/OS-II: https://www.micrium.com/ (needed with NicheStack, requires license from Micrium)
--- a/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt
+++ b/applications/lofar2/doc/prestudy/station2_sdp_dsp.txt
+*******************************************************************************
+* Beamformer
+*******************************************************************************
+M&C:
+* BF weights per PN: 
+  - N_pol * S_pn * S_sub_bf * W_bf_weight * N_complex / W_byte = 2 * 12 * 488 * 16 * 2 / 8 = 46848 byte ~= 48 kByte
+  - N_pol * S_pn = 2 * 12 = 24 times S_sub_bf = 488 complex weights
+  - These weights can be send in 24 packets with 1952 octets/packet
+  - Arria10 has 2713 BRAM of M20k = 2 kByte, so the BF weights use 24 BRAM
+  - BF weight memory options
+    . single buffer -->
+       - BF weights are applied immediately when written,
+       - SCU can send BF weights at arbitrary intervals,
+       - SCU must send BF weights at the time for which they were calculated,
+       - BF weights update rate must be high enough such that they change smoothly
+    . double buffer switch at PPS 
+       - BF weights are applied at next PPS,
+       - SCU must send BF weights in the preceding second
+    . double buffer switch at BSN timestamp
+       - BF weights are applied at scheduled timestamp or immediately if the timestamp is in the past,
+       - SCU can send BF weights at arbitrary intervals,
+       - SCU can send BF weigths in advance within the current update interval.
+- Subband weights and BF weights design decision:
+  . General Jones matrix operation:
+    |wx cy|   |x|   |wx*x + cx * y|
+    |cx wy| * |y| = |wy*y + cy * x|
+  . Requirement [LOFAR2-3098] states that Station beams have to be independent per polarization. Therefore
+    wx /= wy allows making independent X and Y beams. Otherwise wx, wy could have had the same value, because X
+    and Y are at same location and subband calibration is done separately.
+  . cx, cy can be 0 because no polarization correction per element is needed:
+    |wx  0|   |x|   |wx*x|
+    | 0 wy| * |y| = |wy*y|
+  . Wsing cx = wx and cy = wy and wx /= wy allows making two independent unpolarized beams using all antenne elements:
+    |wx  0|   |1 1|   |x|   |wx wx|   |x|   |wx * (x + y)|
+    | 0 wy| * |1 1| * |y| = |wy wy| * |y| = |wy * (x + y)|
+    The (x+y) could be implemented as first (x+y) and then *w, or as first weight and then add. 
+*******************************************************************************
+* Subband correlator
+*******************************************************************************
+*******************************************************************************
+* Transient buffer
+*******************************************************************************
+*******************************************************************************
+* Transient detection
+*******************************************************************************
+*******************************************************************************
+* Subband offload
+*******************************************************************************
\ No newline at end of file
--- a/applications/lofar2/doc/prestudy/station2_sdp_firmware_design.txt
+++ b/applications/lofar2/doc/prestudy/station2_sdp_firmware_design.txt
+*******************************************************************************
+* Detailed Design Document of the LOFAR 2.0 Station SDP firmware
+*******************************************************************************
+? Link with functions in ADD
+? Link with L4 requirements on SDP
+? Link with ICDs (what is described in ICD and what in this DD):
+  * L2-ICD 11207 : RCU2S-SDP (JESD204B)
+  * L2-ICD 11209 : STF-SDP (SYSCLK / SYSREF and 200MHz / PPS)
+  * L2-ICD 11211 : SC-SDP (1GbE, Gemini M&C, create MM registers ICD from YAML files with ARGS)
+  * L2-ICD 11218 : SDP-STCA (no firmware interface)
+  * L1-ICD 11109 : STAT.SDP-CEP (beamlets, transient data read out)
+  * L1-ICD 11108 : STAT.SDP-NW (PHY, ARP, ping, XON/XOFF pause frames, no DHCP)
+? Oversampled subband filterbank first needs modelling design
+Title: Detailed Design of the LOFAR 2.0 Station Digital Processing (SDP) Firmware
+Table of contents
+References
+Terminology
+Definitions
+Introduction
+- Context
+  . ADD fig 3.1-1 (E)ICD and L3 PBS overview
+- Scope
+- Document overview
+Station overview
+  . ADD fig 4.1.1-1 M&C SCU -- PCC -- Unb2
+  . ADD fig 4.5.1.2-1 UniBoard2 with 4 PN
+  . ADD fig 4.5.2-1 Firmware toplevel with ICDs
+  . ADD fig 4.5.2-2 External FPGA interfaces for M&C and data offload
+Hardware architecture (SDP, STCA)
+  . Two UniBoard2 per subrack, one PCC, 32 RCU each with 3 signal inputs (ADCs)
+  . 12 ADC per FPGA, 48 ADC per UniBoard, 96 ADC per subrack
+  . LBA ring : two subracks
+  . HBA ring : one subrack for core (two sub-arrays, but one ring to have subband correlations for all)
+               one subrack for remote
+               two subracks for international
+Firmware infrastructure
+  . BSP (unb2_minimal_gmi)
+    - Clock, reset, PPS, flash, fpga regmap info from YAML
+    - MM bus and ARGS
+    - Gemini M&C protocol (impact of AXI MM and ST)
+  . FPGA interface test designs
+    - M&C using 1GbE (unb2_minimal_gmi)
+    - ADC using JESD204B (unb2_test_adc)
+    - QSFP using 10GbE (unb2_test_qsfp)
+    - Ring using 10GbE (unb2_test_ring)
+    - DDR4 (unb2_test_ddr4)
+  . Board test design
+    - All interfaces (unb2_pinning, unb2_test)
+  . Clock domains
+    - ~50 -100 MHz M&C
+    - 200M ADC, 160M ADC
+    - > 200 MHz for processing to fit S_sub_bf = 488 or even 512?, and to prepare for R_os ~=1.25, f_max Arria10?
+    - transceivers, DDR4
+  . Firmware development
+    - RadioHDL
+    - Revisions
+    - Technology wrappers, component libraries and application libraries
+    - M&C software
+    - Coding style (constants package derived from parameters in doc)
+Firmware architecture
+  . Application overview  (array notation of interfaces and packets, ...)
+    - ADC ingress and time stamp
+    - Subband filterbank (critically sampled)
+    - Subband filterbank (oversampled)
+    - Beamformer
+    - Subband correlator
+    - Transient buffer (DDR4 interface, subband select and DM >= 0, packet format, M&C, RW access via M&C) 
+    - Transient detection
+    - Subband offload
+  . Timing (how it is used, sync interval, PPS event, BSN scheduler)
+  . Quantization (where and how)
+  . Resource usage
+  . Debug, test and monitoring points (test functionality)
+    - BSN monitor
+    - Latency monitor
+    - FIFO fill monitor
+    - 1GbE, 10GbE statistics
+    - DDR4 CRC error counts
+    - Data buffer at signal input, beamlet output
+Prototyping:
+- FPGA - ADC JESD204B links (test board with Unb2b, one to S_pn = 12 inputs coax splitter)
+- FPGA - PC 10GbE link stress tests (pause frames, ARP, data rate)
+Designs:
+- unb2c_minimal_gmi
+References:
+- Preliminary design txt files:
+  . station2_sdp_m_and_c.txt        : Monitoring and control, Gemini protocol
+  . station2_sdp_timing.txt         : Station BSN, timestamp definition, BSN aligner
+  . station2_sdp_ring.txt           : ring access, packets for beamlets, crosslets, subbands, TB readout
+  . station2_sdp_dsp.txt            : beamformer, subband correlator, transient buffer, transient detection, subband offload
+  . station2_sdp_icd.txt            : ICD
+  . station2_sdp_hdl_components.txt : rework existing HDL components for LOFAR2.0
+  . station2_sdp_hdl_article.txt    : reference article on RTL design using RL = 0, state and pipelining, AXI4 streaming
+- Other:
+  . tools/oneclick/doc/desp_firmware_dag_erko.txt
+  . tools/oneclick/doc/desp_firmware_overview.txt
\ No newline at end of file
--- a/applications/lofar2/doc/prestudy/station2_sdp_firmware_planning.txt
+++ b/applications/lofar2/doc/prestudy/station2_sdp_firmware_planning.txt
+*******************************************************************************
+* SDP Firmware planning
+*******************************************************************************
+Includes design, implementation, verification on HW, technical commissioning.
+v1  v2 
+       Infrastructure
+10  20   - Development environment using GIT, RadioHDL, updating existing components
+20   .   - BSP using Gemini Protocol, ARGS
+10   .   - Ethernet access (OSI 1-4)
+10  20   - Ring access
+       Applications:
+15   .   - ADC ingress and time stamp
+20  10   - Subband filterbank (critically sampled)
+ 0  30   - Subband filterbank (oversampled)
+10   .   - Beamformer
+20   .   - Subband correlator
+25   .   - Transient buffer (DDR4 interface, subband select and DM >= 0, packet format, M&C, RW access via M&C) 
+20   .   - Transient detection
+20   .   - Subband offload
+ 0   .   - 160 MHz
+35   . Integration
+     5   - FPGA pinning
+    10   - Interface test designs unb2c
+     5   - Design revisions and lab tests
+    15   - Technical commissioning
+1 week = 100% project allocation, bruto 40 hours, netto 40 * 0.8 = 32 hours = 4 days
+sprint = 100% project allocation, bruto  3 weeks, netto 12 days
+v1 : 10 + 20 + 10 + 10 + 15 + 20 + 10 + 20 + 25 + 20 + 20 + 35 = 215 bruto weeks --> 215 / 40 = 5.4 FTE ~ 3 people each 2 years
+v2 : 10 less for critically sampled PFB
+     10 more for updating existing components
+     10 more for ring access
+     30 for oversampled PFB
+      . consider unb2c test part of SDP FW integration and of SDP HW
+     15 technical commisioning relies on proper Systems Engineering, otherwise may become 50 weeks
+==> EK, JH: v1 estimate of April 2019 is still valid as v2 on 10 Oct 2019.
+v3 : 
+   Infrastructure
+20   - Development environment using GIT, RadioHDL, updating existing components
+ 5   - unb2c FPGA pinning
+10   - unb2c FPGA interface test designs
+20   - Board Support Package using Gemini Protocol and ARGS
+20   - Ring access
+10   - 10GbE access (OSI 1-4)
+   Applications:
+15   - ADC input and time stamp
+10   - Subband filterbank (critically sampled)
+20   - Subband correlator
+10   - Beamformer
+25   - Transient buffer
+20   - Subband offload for AARTFAAC
+20   - Transient detection
+30   - Oversampled subband filterbank
+ 0   - Support 160 MHz
+   Integration:
+10   - Lab tests
+ 5   - Technical commissioning Dwingeloo
+ 5   - Technical commissioning Prototype Station
+All:
+20 + 5 + 10 + 20 + 20 + 10 + 15 + 10 + 20 + 10 + 25 + 20 + 20 + 30 +  0 + 10 + 5 + 5 = 255
+No oversampled filterbank:
+20 + 5 + 10 + 20 + 20 + 10 + 15 + 10 + 20 + 10 + 25 + 20 + 20 +       0 + 10 + 5 + 5 = 225
--- a/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt
+++ b/applications/lofar2/doc/prestudy/station2_sdp_hdl_components.txt
--- a/applications/lofar2/doc/prestudy/station2_sdp_icd.txt
+++ b/applications/lofar2/doc/prestudy/station2_sdp_icd.txt
+ICD interface types:
+    m - Mechanical (structural, loading, tooling, etc)
+    f - Fluid (pneumatic, cooling, heating, condensate, fuels, lubricants, waste, exhaust, feedstocks etc)
+    t - Thermal (cooling, heating, heatsinking, etc )
+    em - Electromagnetic (DC field, RF, etc)
+    o - Optical (numerical aperture, focal position, etc)
+    p- Electrical (i.e. conducted power)
+    e - Electronic (i.e. conducted signals or data)
+    eo - Electro-optical (generally signals or data)
+    d - Data exchange specifications (protocol stack)
+    h - Human-Machine Interface (special combination of some of the above) 
+UDP link control
+- flow control = end-to-end
+- congestion control = peer-to-peer within the network
+  . reliable transmission, at fair rate, with high resource utilization
+  . implemented in network layer
+  . also called transport protocol --> TCP ++, UDP -- (selfish protocol, low delay)
+- ARP
+  . Tx ARP request
+- UDP/IPv4
+  . UDP checksum (not used in LOFAR1)
+LFAA-CSP_Low : OSI (Open Systems Interconnection) layers
+7 Application  : Not applicable, this is the level where the STAT and CEP products each perform their allocated functions.
+6 Presentation :
+  - SPEAD header
+    header first word:
+      magic = 0x53 ='S' 8b, version = 0x4 8b, itemPointerWidth = 0x2 8b, HeapAddrWidth = 0x0 8b, rsvd=0 16b, number of items = 0x8 16b
+    header items:
+      heap_counter     = coarse channel number (1-511) 16b, packet counter 32b # restart at 0 for new observation, 2k samples per packet --> packet counter wraps after few days
+      pkt_len          = packet payload length 48b
+      sync_time        = unix_epoch_time [s] 48b # last time system was syncrhonised by PPS in seconds since 1 Jan 1970
+      timestamp        = timestamp [ns] 48b      # time of center of first sample in packet since sync_time in ADC sample periods of 1.25 ns
+      center_freq      = frequency [Hz] 48b      # center frequency of coarse channel (1-511) * 781250 in Hz
+      csp_channel_info = rsvd 16b, beam_id 16b, freq_id 16b
+      csp_antenna_info = substation_id (1-512) 8b, subarray_id (1-16) 8b, station_id 16b, nof_contributing_antenna (typ. 256) 16b
+      sample_offset    = payload_offset = 0x0
+    data
+      - 1 beam, 1 coarse channel
+      - sampling period is 1.25 ns * 1024 * 27/32 = 1080 ns
+      - 8 bit complex coarse channel samples
+      - Xre, Xim, Yre, Yim = 32b
+      - samples are in strict time order
+      - 2's complement
+      - most negative value -128 indicates error
+5 Session   : Controls connections (start, manage, terminate)
+  - SPEAD header 
+4 Transport : Flow control, error recovery, retransmission
+  - UDP [RFC 768]
+  - The peak data rate on a link shall be no more than 20% (TBC) above the average data rate
+3 Network   : addressing, routing
+  -  IPv4 Internet Protocol
+2 Data link : link between two nodes
+  - Ethernet standard [IEEE Std 802.3-2015], 40 GbE
+1 Physical  :
+  - Ethernet standard [IEEE Std 802.3-2015], 40 GbE
+L1 ICD 11109 : STAT - CEP
+ . Beamlet data
+ . Transient buffer read out
+ . Subband offload (for AARTFAAC)
+STAT-CEP Beamlet data interface:
+- VERSION_ID 8b
+  . 2,3,4 for LOFAR1
+  . 5 first for LOFAR2.0
+- SOURCE_INFO 16b
+  . 2b Array ID (core station 1 LBA, 2 HBA, ...)
+  . 1b f_adc = 200 MHz, 160 MHz
+  . 1b critically PFB, oversampled PFB (or p, q for R_os = p/q)
+  . 4b beamlet width in number of bits (default 8 for W_beamlet = 8 bit, instead of BM = beamlet mode)
+  . 5b UniBoard2 FPGA id (16 FPGAs for LBA, 16 for HBA in International Station, instead of RSP ID)
+- CONFIGURATION_ID 8b (used in LOFAR1? intended to refer to the parset that defines this observation)
+- STATION_ID 16b (idem as LOFAR1)
+- One packet per range of Station beamlets out of 488 beamlets
+  . Full band : S_sub_bf * W_beamlet * N_complex / W_byte = 488 * 8b * 2 / 8b = 976 octets
+  . NOF_BEAMLETS_PER_BANK not needed anymore
+  . nof_streams = Number of beamlet streams
+    - Separate destination address per stream
+    - LOFAR1 supports 4 streams
+    - LOFAR2.0 preferrably supports >> 4 streams
+      - beamlet_id to identify start beamlet in stream (provides more info than a stream ID)
+      - NOF_BEAMLETS_PER_BLOCK to identify range of beamlets from beamlet_id
+      - LOFAR1: beamlet_id = 0 and NOF_BEAMLETS_PER_BLOCK = 61 (dual pol beamlets, 4 streams):
+- NOF_BLOCKS 16b in payload
+  . Multiple beamlet time slots in one packet to increase payload efficiency.
+  . For W_beamlet = 8 bit there can be maximum 9 blocks per payload (9 * 976 = 8784 octets < 9000)
+  . With nof_streams >> 4 the NOF_BLOCKS can become larger, therefore use 16b. For example:
+    - NOF_BEAMLETS_PER_BLOCK = S_sub_bf / nof_streams = 488 / 32 = 16
+    - NOF_BEAMLETS_PER_BLOCK * W_beamlet * N_complex / W_byte = 16 * 8b * 2 / 8b = 32 octets
+    - 9000 / 32 = 281 > 256 --> use 16b for NOF_BLOCKS
+    - nof_streams = 22 destination nodes, each with 8k Byte payload, possibly a double buffer:
+      22 * 8 kByte * 2 = 352 kByte = 176 BRAM (1 BRAM = 2 kByte, FPGA has 2713 BRAM)
+    - 488 / 22 = 22.18, so 488 = 4 * 23 + 18 * 22
+  . Only send correct data to CEP (so no need for SOURCE_INFO/payload error bit).
+  . How to handle blocks that got lost within the Station?
+- TIMESTAMP 64b (instead of 32b seconds TIMESTAMP and 32b BLOCK_SEQUENCE_NUMBER within second)
+  . A 64 bit timestamp in 0.2 ns resolution since t_base = 1970 for first block in payload:
+    - to fit both T_adc = 5 ns and 6.4 ns
+    - for 116 year span since t_base = 1970 --> 2086
+- BLOCK_PERIOD 16b
+  . bit block period in 0.2 ns resolution
+  . 2**16 * 0.2 ns = 13.1 us block period (block rate > 76 kHz) fits T_sub
+- BSN 64b
+  . Block sequence number since t_base = 1970 of first block in payload, increments by 1 for every block
+  . Used to detect lost blocks and to align blocks from different stations
+- TX_PACKET_COUNT 32b
+  . OSI transport layer 4
+  . Per stream
+  . Started at Station power up, increments by 1 for every transmitted packet.
+  . To allow CEP to recognize packets that got lost on the Network, from data blocks that got lost
+    in the Station ring or packets that were not send because the output was disabled.
+  . Only transmit packets that have continous blocks / allow varying number of blocks per packet
+    in case a block is lost on the ring.
+- Data 
+  . X, Y paired dual polarization beamlets
\ No newline at end of file
--- a/applications/lofar2/doc/prestudy/station2_sdp_m_and_c.txt
+++ b/applications/lofar2/doc/prestudy/station2_sdp_m_and_c.txt
+*******************************************************************************
+* Station Control software:
+*******************************************************************************
+The Station contains hardware and software devices that deliver the funnctionality of the application
+[4.1.2.1]. The Station Control software consist of Control and M&C. The Control determines the behaviour
+of the devices in time. Via the M&C the Control can control the devices and monitor them. The M&C uses
+a standard software interface for the Control to access the devices. For the Station M&C the M&C will 
+use OPC-UA as standard M&C access interface for all devices [4.1.2.2]. Only in certain case there can
+be an exception to not use OPC-UA [4.1.2.3.2]
+The M&C system is an abstraction layer between the high level software of the Control and the low level
+software or firmware in the devices [4.1.2.3]. The M&C will use the master-slave pattern to monitor a
+device,so the device will only provide monitoring information on request and never by itself. In this 
+way the control and monitoring traffic are independent. If the device performs a certain task, then it
+may provide a monitoring point that allows the master to monitor the progress. Only for low latency
+events that originate in the device it may be necessary to use the publish-subscribe pattern, whereby
+the slave self-generates an event message.
+*******************************************************************************
+* M&C of SDP firmware
+*******************************************************************************
+For the M&C of the SDP firmware that runs on the array of FPGAs on the UniBoard2s there will be an
+SDP converter/bridge that translates between the FPGA memory map and OPC-UA [4.1.2.3.1]. Using ARGS
+it may be possible to generate the device specific parts of the bridge software, because the number
+of FPGAs and all register fields in the FPGA memory map are known [4.1.2.5.1].
+*******************************************************************************
+* Monitoring interval
+*******************************************************************************
+In LOFAR1 the M&C that is supported by the FPGA firmware has two flavors:
+- Asynchronous (immediate)
+  . C the data point values are applied upon arrival of the request message.
+  . M the data point values are reported upon arrival of the request message.
+- Synchronous (fixed at the PPS grid):
+  . C: the data point values are applied in the next PPS period
+  . M: the data point values of the previous period are reported in this PPS period.
+The asynchronous M&C is suitable if the data point value is static or if its precise timing does not 
+have to be more accurate than what the M&C can achieve (order of 10 ms). The synchronous M&C is
+suitable for data point values that need sample period accurate timing within one FPGA or between
+FPGAs in parallel. The synchronous M&C can be for a single PPS instant or for every PPS instant.
+- Use fixed internal sync aligned to PPS
+  . The disadvantage of a fixed interval is that it is inflexible and cannot be controlled by the SCU,
+    This could also be considered an advantage, because it is well defined and does not need control.
+  . With a fixed interval the monitored information may only reflect what happened during the previous
+    period. Therefore if the monitoring has to be without gaps in time then the SCU needs to monitor
+    and aggregate the information at every period. Using a configurable period this aggregation in the 
+    SCU can be avoided.
+  . SCU must read the statistics in second between two PPS (with some 10 ms margin). This is feasible
+    but a strict grid.
+  . If the SCU reads at arbitrary time, then part of the read values may apply to this second and some
+    to the previous second. For most monitoring this is no problem. If necessary the SCU can wait for
+    PPS and then read the monitoring to ensure that it relates to the same interval on all FPGAs.
+- Use single event BSN timestamp scheduler
+  . Gemini M&C protocol does not have timestamp activated control yet, therefore use separate BSN scheduler
+    control point.
+  . SCU can read the statistics after the scheduled BSN
+  . The next integration lasts until the next scheduled BSN
+  . The programmable interval allows arbitrary intergration intervals, which avoid the need for the
+    SCU to intergrate 1 s intervals in case longer intervals are needed.
+  . The SCU can then scale the statistics result based on the actual integration period of each
+    measured interval, while the intervals are still all without gaps.
+  . Dependent on the speed of the SCU it can use shorter integration intervals, by scheduling the next
+    BSN as soon as it has finished reading the statistics from the previous interval
+  . The BSN scheduler should also provide a monitoring value for the integration interval, i.e. the
+    number of block periods since the previous scheduled BSN.
+  . If the schedule interval is too long then the statistics and monitoring counts may overflow.
+    The values should then clip and not wrap, to show that they overflowed.
+- Use periodic event timestamp scheduler.
+  . Control: The period interval is defined by a start time and a period time. If the period time is -1
+    then the period scheduler acts as a single event scheduler.
+  . Monitor: The periodic scheduler can report current time at when read, time at last event, time at
+    next event (or -1 for no scheduled event) and deltas cur - prev and next - cur.
+  . A periodic event only needs to be setup once by the SCU. The setup can be changed at any time.
+  . The BSN cannot be used directly, because the PPS grid does not always fit the BSN grid. Therefore use
+    the 64 bit timestamp with 0.2 ns resolution to schedule the start time and the period. The event will
+    occur at the BSN slot that is at or directly after the event time.
+  . Default after power up the start time of the timestamp scheduler starts at the PPS using the initial
+    BSN. The default period is 1 s, so 5000000000 [0.2 ns]. In this way the periodic scheduler behaves 
+    similal as the PPS driven sync interval in LOFAR1.
+  . Using the 64 bit timestamp with 0.2 ns is more clear than using a BSN scheduler with fractional BSN
+    period control
+  . For short integration intervals the SCU may not be able to keep up. It is more robust to allow a
+    short but not necessarily constant integration interval, which is known via the monitoring point.
+    Instead of the periodic scheduler the SCU then schedules a new event after it has finished reading
+    the mointoring data from the previous event.
+Behaviour of the data points:
+- Asynchronous:
+  . Only clear data points on control write access, so not as side effect of a monitor read access
+- Synchronous:
+  . Dual page data points swap or shift page at a synchronous event, to provide a precisely timed
+    and stable data value that can be written for control before the event or read for monitor after
+    the event.
+- Apertif MM registers
+  . Async :
+    - ETH control and status
+    - WDI
+    - UNB_SENS
+    - COMMON_PULSE_DELAY
+    - ADC_QUAD
+    - FIL_COEFS  
+    - SS_REORDER
+    - DIAGNOSTICS_BACK       counts clear after dedicated write access
+    - TR_NONBONDED_BACK
+    - DP_RAM_FROM_MM
+    - BF_WEIGHTS
+    - DP_PKT_MERGE
+    - DP_SPLIT
+    - DP_SWITCH
+    - DP_SYNC_CHECKER        side effect counts clear when read
+    - DP_BSN_ALIGN_INPUT           
+    - DP_FIFO_FILL                 
+    - DP_XONOFF_OUTPUT             
+    - DP_OFFLOAD_RX_HDR_DAT        
+    - DP_OFFLOAD_TX_HDR_DAT        
+    - DPMM_CTRL                    
+    - DPMM_DATA                    
+    - MMDP_CTRL                    
+    - MMDP_DATA                    
+    - IO_DDR          
+    - DP_XONOFF_OUTPUT 
+    - DP_OFFLOAD_TX        
+    - TR_XAUI         
+    - MDIO_0                         
+    - TR_10GBE        
+    - EPCS                         
+    - REMU                         
+  . Async, restart immediate after last write
+    Sync, restart by external sync from PPS, BSN scheduler, PPS after write, or
+    - I2C master
+    - DIAG_WG
+    - DIAG_BG
+    - DP_SHIFTRAM
+    - BSN_SOURCE
+  . Sync, generate single event at BSN
+    - BSN_SCHEDULER_WG
+  . Sync, single page, periodic event latch value at every sosi.sync
+    - ADUH_MON (mean, sum)
+    - BSN_MONITOR
+  . Sync, single page, periodic event store values at every sosi.sync, or
+    Async store data after last read
+    - ADUH_MON (buffer)
+    - DIAG_DATA_BUFFER
+  . Sync, dual page monitor, periodic event latch sum values and restart integration at every sosi.sync
+    - ST_SST
+  . SYnc, dual page control, periodic event page swap at sync when last value was written (so only then swap)
+    - DP_FRINGE_STOP_OFFSET
--- a/applications/lofar2/doc/prestudy/station2_sdp_ring.txt
+++ b/applications/lofar2/doc/prestudy/station2_sdp_ring.txt
--- a/applications/lofar2/doc/prestudy/station2_sdp_timing.txt
+++ b/applications/lofar2/doc/prestudy/station2_sdp_timing.txt
+*******************************************************************************
+* Fixed Station BSN grid and the PPS grid
+*******************************************************************************
+The Station needs an external trigger to align all ADCs in the RCU2S and all FPGA procesing nodes in the SDP.
+For this trigger a pulse from the pulse per second (PPS) is used. The PPS is aliged to the top of second of the
+UTC time of day (Tod). The PPS is a hardware trigger that is available within the entire SDP at sample clock
+cycle accuracy. Thanks to the TD the PPS trigger is also available as hardware trigger in all Stations.
+Thanks to the Timing Distributor (TD) the PPS is aligned to UTC ToD that is available to the Telescope Manager
+(TM) in LOFAR2.0 and to Station Control in each Station. The TM controls, via Station Control, which PPS pulse
+is used to start SDP. The PPS is identified by a Seconds Sequence Number (SSN) that counts PPS since a certain
+date in the past, e.g. t_epoch = 1 jan 1970, but some other fixed date is possible too.
+The SDP processes the data in blocks of ADC samples that are identified by a Station block sequence number
+(BSN). The Station BSN time grid should be fixed, so independent of when the data processing starts. Therefore
+the Station BSN counts blocks since the same t_epoch as the SSN, so the t_epoch defines the common reference
+moment in history for the Station BSN grid and for the PPS grid. The PPS grid does not necessarily always
+coincide with the Station BSN grid. The BSN period determines whether the Station BSN can start exactly at an
+PPS or not. To be able to start the data processing at any PPS it is necessary that the Station BSN source
+can start at a programmable fraction of a BSN period after the PPS. In this way processing of one Station ADC
+signal input can be restarted at any PPS (with zero phase offset to the other signal input) and an entire
+Station can be restarted at any PPS (with zero delay offset to the other Stations). The Station BSN source
+ensures that the BSN timing is always on the fixed Station BSN time grid.
+The sample frequency f_adc = 200 MHz is an integer number of Hz and locked to the PPS, therefore the PPS grid
+always coincides with the ADC sample period T_adc = 1/f_adc = 5 ns grid. The Station BSN block period is an
+integer number of N_blk sample periods T_adc. The Station BSN period is set by the subband rate of the subband
+polyphase filterbank (PFB). For the critically sampled PFB the BSN period is N_blk = N_fft = 1024 [T_adc].
+For an oversampled PFB N_blk = N_FFT / R_os [T_adc], so e.g. N_blk = 864 for R_os = 32/27. In both cases the
+Station BSN period is equal to the subband period is T_sub. The offset between the PPS grid and the BSN grid is
+between 0 and N_blk-1 ADC sample periods. Hence to start the processing at any PPS the Station BSN source has
+to be able to start the Station BSN at an offset of 0:N_blk-1 sample periods. The Station BSN source starts at
+this PPS with an initial Station BSN of:
+  initial Station BSN = ceil((SSN * 1 s) / T_sub) = ceil((SSN * f_adc) / N_blk)
+and a BSN fractional offset of:
+  BSN fractional offset = mod(SSN * 1 s, T_sub) / T_adc = mod(SSN * f_adc, N_blk)
+to make sure that the BSN grid is always relative to t_epoch, independent of at which PPS the BSN source was
+started. The Station BSN increments after every block. The time t at the BSN grid is:
+  t = t_epoch + Station BSN * T_sub
+Note: The BSN fractional offset could also be compensated for by delaying the sample data in the signal input
+buffers. Delaying the data does compensate for phase differences in the subband data, but does not compensate
+for the offset in the BSN grid. The BSN alignment buffers between different inputs then still need to compensate
+for this BSN fractional offset. Hence delaying the data is an indirect and incomplete solution, and therefore it
+is not used.
+*******************************************************************************
+* Relation between Station BSN and timestamp in fractional seconds
+*******************************************************************************
+The T_sub and the fixed BSN grid provide sufficient timing accuracy to timestamp the data and any
+M&C upon the data, because:
+- It is not necessary to facilitate using an offset 0 < T_sub_o < T_sub to start the BSN grid at an integer
+  number of T_adc after t_epoch, because the BSN grid is sufficiently fine.
+- It is not necessary to represent fine group delays of digital filters or analogue electronics and 
+  cables in the BSN, because these are all fixed after calibration. Course group delays and cable delay 
+  differences can be compensated for in steps to T_adc via the signal input buffer of every ADC input in
+  SDP. Fine group delay differences within a Station can be calibrated via the subband calibration weights
+  and group delay differences between Stations need to be calibrated at CEP.
+- For the PFB it does not matter at which Station BSN period it was started. What matters is that the PFB
+  for all signal inputs (within a Station and between Stations) is started at the same fixed Station BSN
+  grid, because this ensures that the output from all signal inputs will not get relative phase offsets.
+The resolution T_sub of the Station BSN in LOFAR2 depends on the ADC sample frequency and on the
+subband filterbank:
+  N_blk  T_adc    T_sub       T_sub_i
+  1024 * 5   ns = 5120   ns = 25600 [0.2 ns] for critical sampled filterbank at 200 MHz
+  1024 * 6.4 ns = 6553.6 ns = 32768 [0.2 ns] for critical sampled filterbank at 160 MHz
+   864 * 5   ns = 4320   ns = 21600 [0.2 ns] for oversampled filterbank R_os = 32/27 = 1.185 at 200 MHz
+   864 * 6.4 ns = 5529.6 ns = 27648 [0.2 ns] for oversampled filterbank R_os = 32/27 = 1.185 at 160 MHz
+   800 * 5   ns = 4000   ns = 20000 [0.2 ns] for oversampled filterbank R_os = 32/25 = 1.28 at 200 MHz
+   800 * 6.4 ns = 5120   ns = 25600 [0.2 ns] for oversampled filterbank R_os = 32/25 = 1.28 at 160 MHz
+In LOFAR2 the timestamp should be independent of:
+- using 200 MHz sample rate or 160 MHz sample rate
+- using critically sampled filterbank or oversampled filterbank
+If T_sub was fixed then T_sub could be used as timestamp resolution (like in APERTIF). However T_sub depends
+on the subband filterbank with a resolution of T_adc. If T_adc was fixed then T_adc could be used as timestamp
+resolution. However T_adc depends on the sample clock rate. Therefore the timestamp resolution needs to be
+as fine as the greatest common time resolution of T_adc = 5 ns and T_adc = 6.4 ns, which is 0.2 ns.
+A 64 bit timestamp with 0.2 ns resolution can count 2**64 / (365.25 * 24 * 3600 / 0.2e-9) = 116 years. Hence
+for t_epoch = 1970 this is until 2086, which is sufficient for the lifetime of LOFAR2. Internally in SDP
+firmware use the BSN to count T_sub. Externally at the SDP interface use timestamp values with a resolution
+of 0.2 ns such that they are:
+ * integer values, and
+ * independent of the sample period.
+The actual timestamp in fractional seconds of 0.2 ns follows from:
+  timestamp = Station BSN * T_sub_i * 0.2 [ns].
+The BSN and T_sub_i can be specified as:
+- single 64 bit integer timestamp value of BSN * T_sub_i [0.2 ns]
+- two separate fields with an incrementing BSN and resolution given by T_sub_i [0.2 ns]
+To cover 116 years for a BSN with smallest T_sub = 4000 ns for R_os = 32/25 = 1.28 requires:
+  log2( 116 * (365.25 * 24 * 3600 / 4000e-9) ) = 49.7, so 50 bits
+Therefore allocate 64b in a packet header to send the BSN information. The BSN and timestamp are direcly
+related via T_sub_i, but the advantage of providing the BSN separately is that it increments by 1 for
+each block period T_sub, so it can be used as block index.
+The range of T_sub is 4000 ns - 5120 ns, so the range of T_sub_i is 20000 - 25600. These T_sub_i values
+can be covered in a 16 bit number. Alternatively T_sub_i can be derived from the four possible
+combinations of f_adc = 200M or 160M and R_os = 1 or 32/25, that can be represented with 2 bits.
+* SDO
+RSP uses rad_bsn to caluculate continuous BSN from seconds timestamp, odd and even second and local BSN. The 
+PPS timestamp is captured by the stream sync (so the sync must not have gone lost). The local BSN is derived
+counting eop and restarting at the sync (so no by an input fsn, an therefore lost packets will cause a
+wrong local BSN). The BSN on UniBoard is thus a continous BSN that counts since some PPS in the past,
+defined by the seconds timestamp. The RSP clock frequency is represented by a bit (0 = 160 MHz, 1 = 200 MHz).
+* Continuous Station BSN
+In APERTIF the BSN is a continuous BSN that can start at any PPS, but without support for a BSN fractional
+offset. 
+*******************************************************************************
+* Internal sync interval
+*******************************************************************************
+The BSN source can provide a periodic sync pulse that is passed along with the data, throughout all intermediate
+transport and processing functions within the FPGA. The purpose of the sync pulse can be:
+- To pass along the Station BSN from the central BSN source in the FPGA directly to where it is needed in
+  the FPGA, without having to pass along the 64 bit Station BSN value itself. For the blocks between sync
+  pulses Station BSN is incremented with every block. This implies that during the sync interval lost blocks
+  must be replaced by filler blocks to preserve the BSN derivation. If lost blocks cannnot be replaced, then
+  one lost block causes the remaining blocks in the sync interval to be discarded too.
+- To provide a timing grid at which downstream functions can start or resynchronise in case of lost data. 
+- To provide a fixed update interval for various purposes. In LOFAR1 and APERTIF the sync period is used as 
+  fixed update interval for periodic monitoring, periodic control (the beamformer weights) and periodic 
+  integration intervals (AST, SST, BST and XST). The advantage of a fixed update interval is that it is well
+  defined and does not need control, the disadvantage can be that it is inflexible.
+- To avoid having to derive the sync interval from the BSN where needed in the FPGA. This would require the
+  BSN to be passed along in the FPGA. Deriving the sync interval from the BSN is awkward if the sync interval
+  is not a power of 2 number of blocks.
+Assumptions:
+- Within an FPGA there occur no logic errors
+- Within an FPGA no packets get lost
+- Within an FPGA only correct packets are processed
+- Packets that are received with a CRC error are discarded
+- Packets that are lost or discarded packets remain discarded or are replaced by a filler packet (TBD)
+  . If at least one input of a BSN aligner is still active, then the other inputs can be created if they
+    are not active.
+  . If all inputs of a BSN aligner are inactive, then no output should be created. This can occur with
+    the ring in LOFAR or mesh in APERTIF BF if the BSN source has not been started yet. In APERTIF XC
+    this can occur if all dishes have no output.
+  . If the output has stopped due to that all inputs went inactive, then the output should only resume 
+    at a sync (TBD).
+  . The BSN source can be regarded as a BSN aligner with one input and a local reference.
+  . If lost or discarded data is not replaced, then the packets need to keep their BSN number, because
+    the BSN can then not be derived by locally counting packets within a sync interval. This not a 
+    problem for intergration, but it is for CEP data output.
+  . If the received sync does not coincide with the local reference sync then something unexpected has
+    happened.
+  . If no data is lost then a valid and eop are sufficient to mark all blocks. The first valid and the
+    valids after each eop then identify the start of block (sop). Counting eop (or sop) and adding the
+    initial BSN yield the Station BSN.
+The period of the sync interval can be:
+- Must be longer than the input - output latency within SDP, to ensure that a sync at any FPGA refers to the
+  same time instant. On the ring with N = 16 FPGAs and store and forward at every FPGA the latency will be
+  in the order of 2*N = 32 blocks, so about 32 * T_sub = 32 * 5.12 us = 0.16 ms
+- A period of 1 s [4.5.2.1]:
+  . fits the way in which humans count time
+  . aligns with the PPS grid
+  . is short enough to meet required periodic M&C rates
+  . is long enough for the SCU software to perform time critical M&C in time
+  . suits as intergration period for the periodic statistics like ADC power, SST, BST and XST
+  . suits as update period for the BF weights
+- In LOFAR1 the sync interval was chosen to be aligned to the external PPS, so a period of 1 s. This resulted
+  in having 195313 T_sub for even PPS sync intervals and 195312 T_sub for odd PPS sync intervals.
+- In APERTIF the sync interval was chosen to be an integer number of T_sub, which resulted in 800000 T_sub
+  or a period of 1.024 s
+In LOFAR1 the Station BSN is divided into a 32b seconds sequence number (SSN) that counts PPS intervals and a
+32b local BSN that counts blocks within a sync interval. The MSbit is '1' for to identify the SSN when
+the local BSN is 0. The MSbit is '0' to identify the local BSN. A 31 bit SSN can cover 2**31 *1.024 
+/ 24/3600/365.25 = 69 years. Since t_epoch = 1970 this extents to 2039, which could still be in the life time of
+LOFAR, although the requirement is 10 years.
+For LOFAR the BSN counts T_sub = 5.12 us periods. The sync interval is 1 second, so there are 195313 T_sub 
+periods in even sync intervals and 195312 T_sub periods in odd sync intervals.
+The advantage of transporting 32b BSN is that the BSN is then typically small enough to fit in a data word. The
+data word may be the sosi.data or the concatenation of sosi.im and sosi.re.
+Distinguish between BSN within SDP and timestamp external to SDP. Internally in SDP the local BSN is sufficient,
+because the latency within SDP is only tens of T_sub. The local BSN = 0 marks the start of a sync interval,
+which is aligned to the PPS UTC seconds grid. At an external interface the IO component needs to maintain a 
+SSN that is initialized via M&C. This SSN belongs to the sync interval that starts when the local
+BSN = 0 arrives. In this way the SSN does not have to be transported within SDP, however it does have 
+to be initialized at each IO node that uses it. If the BSN only transports the local BSN then the SSN 
+can be defined as 32b unsigned to cover 2**32 *1.024 / 24/3600/365.25 = 139 years. Since 1970 this extents to
+2109, to well beyond 2039.
+The SSN does not have to be transported along with the internal data stream. Instead like on RSP the
+SSN can be maintained centrally on each FPGA. The SSN is initalized via M&C. On RSP the M&C
+has to write the timestamp at each PPS interval. For SDP choose to let M&C only initialize the timestamp and
+have the firmware increment the timestamp = SSN at the PPS. The sosi.sync in a data stream captures
+the SSN, like on RSP. Thanks to the data path latency the sosi.sync in the data stream has a
+latency > 0, but <<< 1 s with the PPS. The central SSN can be 32b to cover 139 years until 2109.
+What if the sosi.sync is lost? A lost sync is detected by the local BSN being smaller than the previous
+local BSN. This can also be used to capture the SSN, except for the end of the sosi.sync interval
+that covers the latency between PPS and sosi.sync. Alternatively, like in RSP, except that a sosi.sync may
+get lost and cause a timestamp to be wrong for that sync interval. This corrupts the entire sync interval,
+but the chance of a missed sosi.sync is small.
+UTC time = SSN * 1 s + local BSN * T_sub + mod(SSN * 1 s, T_sub)
+The SDP firmware can derive UTC time from SSN and local BSN. To avoid having to calculate mod() the
+M&C should with the initial BSN also provid mod(initial BSN * 1 s, T_sub) and the increment of
+mod(1 s, T_sub). The SSN and corresponding UTC time in T_sub units can be monitored via M&C.
+UTC time = init BSN * 1 s + mod(init BSN * 1 s, T_sub) +
+                  p * 1 s + mod(       p * 1 s, T_sub) +
+           local BSN * T_sub
+Use central SSN and transport local BSN. However at sync we could still transport the 31 LSbits of
+the SSN instead of the known 0 of the local BSN. This could be used for monitoring purposes and as
+a useful index in simulation.
+The CDO in LOFAR 1.0 sends a 32 bit SSN (called timestamp) to CEP and a local BSN. The local
+BSN is based on the 16 bit FSN for bits [15:0] and every time the FSN = -1 the local BSN is incremented
+by 2**16. This scheme allowes using an FSN of only 16 bit, but relies on that the block with FSN = -1 =
+2**16-1 did not get lost. Alternatively the check on FSN = -1 could be replaced by FSN < prev FSN. The 
+FSN and local BSN both restart at the sync, so they also both rely on that the block with the sync did not
+get lost.
+Can we decouple the FSN used for frame alignment from the BSN used as UTC timestamp? Important aspects:
+- UTC since 1 Jan 1970
+  . 2**64 / (365.25 * 24 * 3600 / 5 ns) = 2922 years
+- Start or restart processing at a PPS
+- UTC/PPS provide the common reference to SC and SDP
+- Timestamp unit:
+  . sample period              T_adc   = 5 ns                                    --> 200000000
+  . subband period             T_sub   = 1024 * T_adc = 5.12 us                  --> 195312.5 = 195312 + 1/2
+  . oversampled subband period T_osub  = T_sub / R_os = 5.12 / (32/27) = 4.32 us --> 231481.481 = 231481 + 13/27
+  . channel period             T_chan  = 128 * T_sub = 655.36 us                 --> 1525.87890625 = 1525 + 225/256
+  . oversampled channel period T_ochan = 128 * T_osub = 552.96 us                --> 1808.449074074074 = 1808 + 97/216
+  . integration period, PPS period
+  In general the periods do not integer fit with the 1 s PPS grid.
+- Subband block phase
+- Signal input samples buffer
+Design decisions:
+- Use a pulse from the PPS to align ADCs via JESD204D
+- Use a pulse from the PPS to start the internal sync interval
+- Use a pulse from the PPS to start the subband processing, so start a 1 s UTC grid
+- Allow starting the subband processing at any PPS, the subband phase offset can be controlled using the
+  signal input samples buffer.
+- Use 1 s period of 200M (or 160M) cycles for the internal sync interval, because this fits the integration
+  time for the statistics and the update rate of the beamlet weights
+- The number of samples per sync interval of 200M or 160M must be programmable via M&C. The default is 200M,
+  but in simulation it can be much less.
+- Use central UTC timestamp at PPS initialized by M&C and incremented by SDP firmware for the SSN
+  per FPGA.
+- Use 32 bit SSN to fit UTC in seconds for 136 years since 1970
+- Use local BSN that counts data blocks within a sync interval, so it restarts at 0 at the internal sync
+- Within SDP transport the sync and the local BSN. The sync is transported via the MSbit of the local BSN.
+  At the sync transport the 31 bit SSN instead of local BSN 0, but only for monitoring purposes.
+- Derive 64 bit UTC timestamp in units of T_sub in SDP firmware and use this for data output to CEP
--- a/applications/lofar2/doc/prestudy/station2_semi_float32.txt
+++ b/applications/lofar2/doc/prestudy/station2_semi_float32.txt
+Semi-floating point values for SST, XST and BST
+int2float.vhd uses a 32bit semi-float with 1 bit exponent and 31 bit mantissa
+- float_w = 32
+- int_w = 54
+- exp_w = 1
+- mantissa_w = float_w - exp_w = 31
+- mantissa = 2**31
+- base_w = int_w - mantissa_w = 23
+- base = 2**base_w = 2**23
+- if int in range -mantissa/2 to +mantissa/2-1 then exp = 0 and float = int
+                                               else exp = 1 and float = round(int / base)
+- if exp = 0 --> int = mantissa
+- if exp = 1 --> int = mantissa * base
+53 52 51 50 49 48 47 46 45 ... 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 ... 4  3  2  1  0
+<s--s--s--s--s--s--s--s--s-...--s--s><s---------mantissa-----------------------...-------------->
+53 52 51 50 49 48 47 46 45 ... 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 ... 4  3  2  1  0
+<s------mantissa-----------...------------------------------><------2**23------...--------------> 
+The SNR decrease due to quantization is:
+  (s/s0)^2 = 1 + 1/(12*g^2)
+where s = sigma, s0 = sigma of sky noise, d = quantization step, g = s0 / d.
+After integration over N powers the SNR improves by a factor sqrt(N), so the integrated power S of s needs
+to be represented by log2(sqrt(N)) extra bits to contain the processing gain of the incoherent integration
+plus twice as much bits as s to contain the power s^2. The sqrt(N) processing gain for integrated powers is
+also derived in MEM-131 of ARTS by SJW:
+  (s/s0)^2 = 1/N + 1/(12*g^2)
+For N = 1 there is no integration, so then the SNR decrease is defined by the input quantization. For N > 1
+the integration improves the SNR by a factor N, so the quantization needs to become finer as well.
+The power value of s^2 needs twice as many bits, to contain the whole range, but only half of these are
+significant, because squaring a number does not add information. Hence for sigma s with 4 bits including
+sign bit and N = 195312.5 so log2(sqrt(195312.5)) = 8.8 bits this results in at least 4 + 8.8 = 12.8 bit.
+The power values use twice as many bit, so 2 * 12.8 - 1 sign = 24.6 bits or about 25 bits to represent
+the SST and XST power values. For the BST the BF also provides a processing gain for incoherent noise
+of log2(sqrt(96)) = 3.3 bits, so the BST need about 12.8 + 3.3 = 16.1 bits. The power values use twice as
+many bits, so 2 * 16.1 - 1 sign = 31.2 bits or about 31 bits to represent the BST power values.
+The semi-floating point value of 32 bit with 1 bit exponent can represent 31 bit values without extra
+rounding, which is suitable to represent the SST, XST and BST values for measurements without RFI or
+weak RFI. The exponent = 1 representation is suitable and needed to represent the strong RFI signals.
+The range of the semi-floating point values for the powers can be increased by first rounding the powers
+by up to 4 + 8.8 = 12.8 LSbits, because these power bits are not significant. A safe value would be to
+round e.g. 8 LSbits of the power int values before converting them to semi-floating point. The calculated
+power values then become log10(2**8) = 24 dB lower.
+Note:
+- The ADC sign bit is also a bit that counts in the SNR as 6 dB. The sigma value is positive by definition.
+  For (s/s0)2 = 1.01, so 1 % worse SNR due to quantization, g = 3.53 and s0 = 3.53 d. Hence s0 is log2(3.53)
+  = 1.8 bit including sign bit (1.8b = 10.8 dB). For Gaussian noise the -3 to +3 sigma range contains 99 %
+  of the values. The 3 sigma corresponds to log2(3) = 1.6 bit. In total the ADC 3 sigma input then fits in
+  3.4 bit, so use 4 bit as a practical lower limit for ADC input quantization with negligible quantization
+  loss.
+- The rounding of a large value A can cause that +A and - A becomes +B and -B+1, when rounding is done 
+  to + infinity (which is what round() does). In the LOFAR 1.0 subband correlator this caused confusion,
+  because both the X*Y and Y*X were calculated, which should agree to X*Y = conj(Y*X), but due to rounding
+  affect could differ by 1. For LOFAR 2.0 this can again occur for the cross correlation of the local
+  signal inputs per PN. The cross correlations with the remote signal inputs are calculated only once, so
+  there the rounding affect is not noticed. To avoid this difference common_int2float.vhd in LOFAR 1.0
+  can use g_symmetric=TRUE, which applies truncating to zero.
--- a/applications/lofar2/doc/prestudy/station2_to_do_erko.txt
+++ b/applications/lofar2/doc/prestudy/station2_to_do_erko.txt