Skip to content
Snippets Groups Projects
Commit a50b699e authored by Pieter Donker's avatar Pieter Donker
Browse files

Merge branch 'master' of git.astron.nl:desp/hdl

parents 2a32f406 14fe61d5
Branches
No related tags found
No related merge requests found
Showing
with 3343 additions and 0 deletions
File added
Idea / rule: Distinguish beteen state registers and pipeline registers.
. The state registers keep the state of the function and the function itself is programmed in combinatorial logic.
In this way the pipelining that is needed to achieve timing closure can be added independent of the function.
This approach could be described in a paper, because it is quite significant and differs from the well known
Gailser approach (that uses RL=1 and does not separate state from pipeline). AXI uses RL=0 but need to check
how it then handles pipelining.
. Components need pipelining to achieve timing closure. This pipelining causes a latency in the data
stream. This latency is typically no problem, because it only delays the output. If components need
flow control then the stream has a siso backpressure signal that must have a certain timing relation
to the sosi data signal. This timing relation is the ready latency (RL) and the RL can be >= 0. For
RL = 0 the ready signal acts as a data acknowledge and for RL > 0 the ready signal acts as a data
request signal. Adding pipelining to the sosi data increases the RL.
. The RL is explained in the Avalon specification. An example of RL = 0 are so called look ahead (Altera)
or first word fall through (Xilinx) FIFOs. In our UniBoard applications we use RL = 1. For most parts
of the design we try to not use flow control. I think that the Axi stream use RL = 0.
. The function operates with ready latency (RL) = 0, if it is combinatorial. If the stream has no flow
control then the pipeline is achieved as an output register stage. If the stream does need flow control,
then this output register stage increases the RL by 1. To restore the RL to 0 a dp_latency_adapter.vhd
is needed. This latency adapter also registers the ready, so it provides pipelining for both the output
stream sosi data as well as the output stream siso ready flow control.
. For new components the development approach implement the function for RL=0, so only with the state
registers. If the component does not use flow control, then it may still just wire the flow control
from output to input. If the component does use flow control than it can combinatorially impose this
on the incomming flow control and pass the combined flow control on to its input. For timing closure
the pipelining is added as a seperate stage. Either pipeline sosi if no flow control is needed
or pipeline siso if flow control is needed. For example: dp_block_resize.vhd, dp_counter.vhd.
Ref:
$RADIOHDL/tools/oneclick/doc/desp_firmware_dag_erko.txt
$RADIOHDL/tools/oneclick/doc/desp_firmware_overview.txt
\ No newline at end of file
1) Introduction
a) Focus on checking UniBoard2 based solution, because:
- we need an FPGA to interface the ADC
- using an existing board saves development time.
b) Assumptions:
- the subband filterbank will be implemented on the FPGA, because we need the FPGA anyway so then it can
also do some DSP
- the station beamformer is implemented on the FPGA because for a LOFAR station the number of beams is
small, so this yields a larger data rate reduction to the subsequent processing (on CPU / GPU)
- use a ring beamformer (like in Lofar 1.0), because it avoids having to use a large mesh (like in Apertif
BF) and because the ring can easily be extended with more nodes if necessary (like for the international
Lofar 1.0 stations).
- Starting with critical sampled filterbank saves time, in design allow for oversampled filterbank
c) New compared to Lofar 1.0
- Same analogue band width and sample frequency ranges, but in total 4x more RCU input:
. 3 times more input due to simultaneous 2x LBA + 1x HBA
. ready for 4 times more input to support another 1x HBA input for Lofar Space Weather
- Ready for output to Aartfaac 2.0
d) Other relevant aspects
- System requirements must be clear and complete at PDR, otherwise the project will delay due to unclarity
- We used to have delay and therefore 'conflict' at end of project due to reactive, passive planning, now
we need 'conflict' at start of project to be proactive and meet the end date.
- X and Y and LBA and HBA can be implemented on independent hardware because they are processed
independently. The same firmware can be the same (apart from different parameter settings). Therefore
best use separate RCU for LBA and HBA, instead of combining LBA + HBA on one RCU, because otherwise they
will also need to be processed toghether in a subrack (assuming that the serial ADC link goes via a
backplane and not via a fiber.)
- the current HBA uses DC power via one coax (x-pol) and control via the other coax (y-pol). The control
uses a propietory contral protocol based on Manchester encoding and implemented using a PIC micro
controller. The PIC microcontroller is a I2C slave.
- all input should also be available at output of the FPGA to be future prove, this is a lesson learned
from Lofar 1.0 where only with high effort still only a small band could be made available for Aartfaac
- during life time of 10 year FPGAs remain available, GPU will require an upgrade to a new version
d) Development time:
- Starting with critical sampled filterbank saves time, in design allow for oversampled filterbank
- With a critical sampled filterbank like in Lofar 1.0 the new Lofar 2.0 station can operate together with
a Lofar 1.0 station
- Much reuse from Apertif and RSP firmware
- Some new aspects:
. oversampled filterbank (oversample factor increases the output load)
. JESD serial ADC data interface
. how to connect RCU I2C control interface (via microprocessor with 1GbE on PAC)
. TBB function on UniBoard2
. separate MM clock domain and sample clock domains (160, 200 MHz)
. reuse M&C protocol from Gemini instead of UniBoard Control Protocol
. station correlator via TBB function or via crosslet statistics (similar as in RSP, Aperif PAF
correlator)
- Detailed design must include M&C and test
2) Oversampled filterbank:
- See dupllo_oversampled_subband_filterbank.txt.
3) TBB memory
a) Lofar 1.0 (96 RCU for core and remote, 192 for interbnational)
There is 1 TBB / 2 RSP, so 1 TBB / 16 RCU --> so 32 GByte/ 16 RCU = 2 GByte / RCU.
With 200 MHz and assume 2 byte per sample this corresponds to 2 GByte / 0.2 GHz / 2 byte = 5 sec.
b) 6 UniBoard1 (288 RCU)
The largest DDR3 SODIMM that can fit on UniBoard is 16 GByte and each PN on UniBoard can have two DDR3
SODIMMs. With 6 UniBoard1 there are 6 * 8 * 2 * 16 GByte = 1536 Gbyte for 288 RCU = 5.3 GByte / RCU.
With 200 MHz and assume 2 byte per sample this corresponds to 5.3 GByte / 0.2 GHz / 2 byte = 10.3 sec.
Uses 6 * 8 * 2 = 96 DDR3 SODIMMs. DDR3 can achieve 1.6 GTps @ 200 MHz.
c) 3 or 4 UniBoard2 (288 RCU for 2x LBA + HBA, 384 including also 1x HBA for Space Weather)
The largest DDR4 SODIMM that can fit on UniBoard2 is 36 GByte and each PN2 can have two DDR4 SODIMMs.
With 3 UniBoard2 there are 3 * 4 = 12 PN, so in total 12 * 2 * 36 Gbyte = 864 Gbyte for 288 RCU = 3
GByte / RCU. With 200 MHz and assume 2 byte per sample this corresponds to 3 GByte / 0.2 GHz / 2 byte
= 7.5 sec. Uses 12 * 2 = 24 DDR4 SODIMMs. The required write rate per SODIMM is 288 RCU * 16b * 200 MHz
/ 24 SODIMMs = 38.4 Gbps. The data width of the SODIMM is 64b (or 72b) so this is 38.4 Gbps / 64b =
0.6 GTps, which is easily feasible, because DDR4 can achieve 3.2 GTps @ 400 MHz (transfers per second).
==> 1 UniBoard2 per 96 RCU can buffer 1.5 more transient data than Lofar 1.0. Possibly use factor 2 less
number of DDR4 or use smaller DDR4.
d) UniBoard2 + external TBB storage cluster:
Perhaps the TBB function can be implemented on an external storage cluster, because UniBoard2 can output
all input. The total data rate is 288 * 200M * 16b = 288 * 3.2 Gbps = 921 Gbps, so with 2 ADC / 10GbE
link this requires 144 10GbE links. For 10 s the cluster needs about 1152 TByte = 72 * 16 Gbyte DDR
modules.
4) FPGA resource usage
See Station ADD section 4.5.2.9
[1] "HBA Control Design Description", LOFAR-ASTRON-MEM-175, apr 2010, E. Kooistra
[2] "RSP Firmware Design Description", LOFAR-ASTRON-SDD-018, sep 2013, E. Kooistra
\ No newline at end of file
Oversampled filterbank:
1) Purpose
- to measure line spectra in the channels at the edges of a subband, could the AAF for Apertif be an alternative?
- to use a synthesis filterbank on the beamformed data, why reconstruct the time series ?
2) Working of analysis oversampled filterbank
PFB = PFIR -> FFT
The polyphase filterbank (PFB) consists of a FIR prefilter (PFIR) and an FFT. The downsample factor is set by the FFT block size N_fft. For computational efficiency N_fft needs to be a power of 2, but a factor 3 or 5 may be included too. The PFIR section has N_fft phases and N_tap taps per phase. The coefficients follow from a low pass prototype FIR filter, as a snake pattern for all taps, for all points. In a criticaly sampled PFB the input data is shifted in in blocks of size N_fft. In an oversampled PFB the data is shifted in in blocks of size M and M < N_fft, so r = N_fft/M is the oversample factor. The shift less then N_fft causes a phase step between blocks in the PFB output. This phase step can be compensated by counter rotating the data that inputs into the FFT [harris, tuthil].
The oversampling N_fft / M also implies that multiple PFB in parallel also need to keep aligned not only the N_fft blocks, but also oversampling sub blocks M. In ASKAP r = 32/27 with 1 MHz subbands causes that an integer number of fine channels periods takes 27 seconds, so causing a periodicity at large time scales to align at the human (and VDIF) 1 sec grid.
0 f_s/2
|-.-|---|..............|---|
.
|-.-| f_sub/2
.
<-.-> N_chan
.
|--.--| f'_sub/2
.
<--.--> N'_chan
For the critically sampled PFB the downsampled frequency per subband is f_sub = f_s / N_fft. In case of a real input their are N_sub = N_fft / 2 subbands, where the factor 2 is because for a real input only the positive and negative frequency spectra are complex conjugate, so only half of the subbands are unique.
In the PFB this results in that each downsampled subband is centred around 0 Hz with subband sample frequency f_sub and complex subband samples. Hence for a complex signal the Nyquist sample rate is equal to the bandwidth, so the Nyquist factor 2 then appears in the fact that the signal is complex, so with 2 values (real and imaginary) per sample.
The subband bandwidth B_sub is determined by the PFIR and independent of the subband rate f_sub, so B_sub <= f_sub. The f_sub = f_s / N_fft defines the frequency grid. The f'_sub > f_sub makes it possible to oversample B_sub and to have B_sub = f_sub without aliasing. For the oversampled filterbank the f'_sub = r * f_sub. The subband bandwidth B_sub can be selected such that it is still almost flat up to f_sub and then drops down to the stop band level at f'_sub. The width of the transition region is set by r. ASKAP and SKA LFAA use r = 32/27 ~= 1.185. For two neighbour subbands the transition region to attenuate the aliasing is 2*(r-1)*f_sub. A larger oversampling factor r eases the PFIR filter for a required aliasing attenuation, but increases the data rate.
Oversampling does not change the frequency grid of the PFB, because the frequency grid is set by the FFT size. The oversampling only increases the sample rate per frequency bin (subband or channel) and this can be used to achieve more attenuation between neighbouring bins (subband or channel) to eliminate aliasing.
---- ---- ^
\ / .
\/ .
/\ .
/ \ .
/ \ v
<-> aliasing attenuation
f'_sub
f_sub
The subbands (coarse channels) are again separated into smaller bandwidth channel (fine channels). The number of channels in f'_sub is N'_chan, so f'_chan = f'_sub / N'_chan. If f_sub = K * f'_chan then K * N_sub channels from the oversampled subbands provide a continuous flat spectrum, without aliasing between subbands. The N'_chan - K channels in transition regions are dropped. The channel PFB The FFT size of the channel PFB is equal to the number of channels N'_chan, because the channel PFB has complex subband input.
Define r = p/q = N_fft/M where p and q are the smallest integers to represent r.
f_sub = f'_sub/r = N'_chan * f'_chan / r = K * f'_chan
--> K = N'_chan / r = N'_chan * q / p
Hence to fit the integer constrain for K both N_fft and N'_chan must be integer dividible by p. The q is free to choose, but must be integer and <= p.
Beamforming is done per subband sample from S_ant inputs. The result is a beamlet, which can be regarded as a subband with direction. A subband may be used for multiple beam directions, so it results in a beamlet for each direction. For the subband and beamlet samples the data rate is a factor r higher, it is only after a channel PFB that the channels in the transistion band can be dropped.
3) Compatibility with LOFAR 1.0
In LOFAR 1.0 the subband PFB F_sub has N_fft = 1024, so N_sub = 512. The channel PFB F_chan has N_chan = 16, 64 or 256 channels. The 16 channels is use for pulsar timing (PST). In LOFAR 1.0 both F_sub and F_chan are critically sampled. Using r = p / q = 32 / 27 for LOFAR 1.0 with 64 channels fits and yields a spectrum with 54 channels per f_sub, so the channel width then increases by the oversample factor.
To achieve the same width as for LOFAR 1.0 requires using r = 2 and N'_chan = 128, because r = p/q = 2/1 then yields N_chan = 64 channels per f_sub. Compared to a LOFAR 1.0 channel the phase slope over the channels from an oversampled F_sub will be a factor r less, due to that f'_sub = r * f_sub.
I do not think it is possible to support LOFAR 1.0 channel width with an oversampled F_sub for r < 2. Also not with an oversampled channel PFB, because oversampling does not change the channel frequency grid. Using r = 2 does fit the existing LOFAR 1.0 frequency grid, but will cause a factor r = 2 higher output rate to CEP, because the data rate can only be reduced again after the channel filter. Therefore a solution can be to move the fine channel filter from CEP to the stations.
4) Required oversampling factor
The required oversampling factor depends on the stop band attenuation and stop band bandwidth, and is a trade of between data rate and processing load. The N_fft = 1024 is a power of 2, so p in r = p/q also has to be a power of two, e.g.:
32/28 = 8/7 ~= 1.143
32/27 ~= 1.185 <-- used by ASKAP, LFAA
32/26 = 16/13 ~= 1.231
32/25 = ~= 1.280
32/24 = 4/3 ~= 1.333
5) Working of synthesis oversampled filterbank
Reconstruction from f'_sub (beamlets) or from f'_chan
Why reconstruct to time series, to sperate to new channels?
Reconstruct the whole band or only a part of the band e.g. 16 MHz for VLBI?
OPC-UA is IEC 62541 standard
Large open platform independent standard, but if only a subset of the features is supported, then the it becomes less
standard or platform independent.
OPC classic = Object Linking and Embedding (OLE) for Process Control
OPC = Open Platform Communications.
OPC-UA = OPC Unified Architecture
https://opcfoundation.org/
http://wiki.opcfoundation.org/index.php/UA_Overview
https://en.wikipedia.org/wiki/OPC_Unified_Architecture
- Service oriented architecture (SOA) using asynchronous request/response pattern
- transport: via TCP in binary or web based
- data model: more than hierarchy of files/folder/registers, object oriented nodes that can send meta information and data
- expandability via profiles:
. DI = device integration
. DA = data access
. A&C = alarms and conditions
. HDA = historical data access
- security
- authentication
Needed:
- OPC-UA SDK (software development kit)
. Considerations regarding Software Development Kits for OPC-UA:
http://www.ascolab.com/images/stories/ascolab/doc/ua_whitepaper_implementation_e.pdf
- UA server requires at least ~200 kByte RAM
. https://documentation.unified-automation.com/uasdkhp/1.0.0/html/index.html
- TCP/IP stack
. NicheStack (free via Intel)
https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/tt/tt_nios2_tcpip.pdf
. https://www.micrium.com/rtos/tcpip/
. Lightweight IP for UDP about 5.1 Mbps transmit and 3.4 Mbps receive using NiosII/f at 50 MHz, so less for TCP
https://www.ee.ryerson.ca/~courses/coe718/Data-Sheets/RTOS/tt_nios2_lwip_tutorial.pdf
- RTOS (realtime operating system)
. MircoC/OS-II: https://www.micrium.com/ (needed with NicheStack, requires license from Micrium)
*******************************************************************************
* Beamformer
*******************************************************************************
M&C:
* BF weights per PN:
- N_pol * S_pn * S_sub_bf * W_bf_weight * N_complex / W_byte = 2 * 12 * 488 * 16 * 2 / 8 = 46848 byte ~= 48 kByte
- N_pol * S_pn = 2 * 12 = 24 times S_sub_bf = 488 complex weights
- These weights can be send in 24 packets with 1952 octets/packet
- Arria10 has 2713 BRAM of M20k = 2 kByte, so the BF weights use 24 BRAM
- BF weight memory options
. single buffer -->
- BF weights are applied immediately when written,
- SCU can send BF weights at arbitrary intervals,
- SCU must send BF weights at the time for which they were calculated,
- BF weights update rate must be high enough such that they change smoothly
. double buffer switch at PPS
- BF weights are applied at next PPS,
- SCU must send BF weights in the preceding second
. double buffer switch at BSN timestamp
- BF weights are applied at scheduled timestamp or immediately if the timestamp is in the past,
- SCU can send BF weights at arbitrary intervals,
- SCU can send BF weigths in advance within the current update interval.
- Subband weights and BF weights design decision:
. General Jones matrix operation:
|wx cy| |x| |wx*x + cx * y|
|cx wy| * |y| = |wy*y + cy * x|
. Requirement [LOFAR2-3098] states that Station beams have to be independent per polarization. Therefore
wx /= wy allows making independent X and Y beams. Otherwise wx, wy could have had the same value, because X
and Y are at same location and subband calibration is done separately.
. cx, cy can be 0 because no polarization correction per element is needed:
|wx 0| |x| |wx*x|
| 0 wy| * |y| = |wy*y|
. Wsing cx = wx and cy = wy and wx /= wy allows making two independent unpolarized beams using all antenne elements:
|wx 0| |1 1| |x| |wx wx| |x| |wx * (x + y)|
| 0 wy| * |1 1| * |y| = |wy wy| * |y| = |wy * (x + y)|
The (x+y) could be implemented as first (x+y) and then *w, or as first weight and then add.
*******************************************************************************
* Subband correlator
*******************************************************************************
*******************************************************************************
* Transient buffer
*******************************************************************************
*******************************************************************************
* Transient detection
*******************************************************************************
*******************************************************************************
* Subband offload
*******************************************************************************
\ No newline at end of file
*******************************************************************************
* Detailed Design Document of the LOFAR 2.0 Station SDP firmware
*******************************************************************************
? Link with functions in ADD
? Link with L4 requirements on SDP
? Link with ICDs (what is described in ICD and what in this DD):
* L2-ICD 11207 : RCU2S-SDP (JESD204B)
* L2-ICD 11209 : STF-SDP (SYSCLK / SYSREF and 200MHz / PPS)
* L2-ICD 11211 : SC-SDP (1GbE, Gemini M&C, create MM registers ICD from YAML files with ARGS)
* L2-ICD 11218 : SDP-STCA (no firmware interface)
* L1-ICD 11109 : STAT.SDP-CEP (beamlets, transient data read out)
* L1-ICD 11108 : STAT.SDP-NW (PHY, ARP, ping, XON/XOFF pause frames, no DHCP)
? Oversampled subband filterbank first needs modelling design
Title: Detailed Design of the LOFAR 2.0 Station Digital Processing (SDP) Firmware
Table of contents
References
Terminology
Definitions
Introduction
- Context
. ADD fig 3.1-1 (E)ICD and L3 PBS overview
- Scope
- Document overview
Station overview
. ADD fig 4.1.1-1 M&C SCU -- PCC -- Unb2
. ADD fig 4.5.1.2-1 UniBoard2 with 4 PN
. ADD fig 4.5.2-1 Firmware toplevel with ICDs
. ADD fig 4.5.2-2 External FPGA interfaces for M&C and data offload
Hardware architecture (SDP, STCA)
. Two UniBoard2 per subrack, one PCC, 32 RCU each with 3 signal inputs (ADCs)
. 12 ADC per FPGA, 48 ADC per UniBoard, 96 ADC per subrack
. LBA ring : two subracks
. HBA ring : one subrack for core (two sub-arrays, but one ring to have subband correlations for all)
one subrack for remote
two subracks for international
Firmware infrastructure
. BSP (unb2_minimal_gmi)
- Clock, reset, PPS, flash, fpga regmap info from YAML
- MM bus and ARGS
- Gemini M&C protocol (impact of AXI MM and ST)
. FPGA interface test designs
- M&C using 1GbE (unb2_minimal_gmi)
- ADC using JESD204B (unb2_test_adc)
- QSFP using 10GbE (unb2_test_qsfp)
- Ring using 10GbE (unb2_test_ring)
- DDR4 (unb2_test_ddr4)
. Board test design
- All interfaces (unb2_pinning, unb2_test)
. Clock domains
- ~50 -100 MHz M&C
- 200M ADC, 160M ADC
- > 200 MHz for processing to fit S_sub_bf = 488 or even 512?, and to prepare for R_os ~=1.25, f_max Arria10?
- transceivers, DDR4
. Firmware development
- RadioHDL
- Revisions
- Technology wrappers, component libraries and application libraries
- M&C software
- Coding style (constants package derived from parameters in doc)
Firmware architecture
. Application overview (array notation of interfaces and packets, ...)
- ADC ingress and time stamp
- Subband filterbank (critically sampled)
- Subband filterbank (oversampled)
- Beamformer
- Subband correlator
- Transient buffer (DDR4 interface, subband select and DM >= 0, packet format, M&C, RW access via M&C)
- Transient detection
- Subband offload
. Timing (how it is used, sync interval, PPS event, BSN scheduler)
. Quantization (where and how)
. Resource usage
. Debug, test and monitoring points (test functionality)
- BSN monitor
- Latency monitor
- FIFO fill monitor
- 1GbE, 10GbE statistics
- DDR4 CRC error counts
- Data buffer at signal input, beamlet output
Prototyping:
- FPGA - ADC JESD204B links (test board with Unb2b, one to S_pn = 12 inputs coax splitter)
- FPGA - PC 10GbE link stress tests (pause frames, ARP, data rate)
Designs:
- unb2c_minimal_gmi
References:
- Preliminary design txt files:
. station2_sdp_m_and_c.txt : Monitoring and control, Gemini protocol
. station2_sdp_timing.txt : Station BSN, timestamp definition, BSN aligner
. station2_sdp_ring.txt : ring access, packets for beamlets, crosslets, subbands, TB readout
. station2_sdp_dsp.txt : beamformer, subband correlator, transient buffer, transient detection, subband offload
. station2_sdp_icd.txt : ICD
. station2_sdp_hdl_components.txt : rework existing HDL components for LOFAR2.0
. station2_sdp_hdl_article.txt : reference article on RTL design using RL = 0, state and pipelining, AXI4 streaming
- Other:
. tools/oneclick/doc/desp_firmware_dag_erko.txt
. tools/oneclick/doc/desp_firmware_overview.txt
\ No newline at end of file
*******************************************************************************
* SDP Firmware planning
*******************************************************************************
Includes design, implementation, verification on HW, technical commissioning.
v1 v2
Infrastructure
10 20 - Development environment using GIT, RadioHDL, updating existing components
20 . - BSP using Gemini Protocol, ARGS
10 . - Ethernet access (OSI 1-4)
10 20 - Ring access
Applications:
15 . - ADC ingress and time stamp
20 10 - Subband filterbank (critically sampled)
0 30 - Subband filterbank (oversampled)
10 . - Beamformer
20 . - Subband correlator
25 . - Transient buffer (DDR4 interface, subband select and DM >= 0, packet format, M&C, RW access via M&C)
20 . - Transient detection
20 . - Subband offload
0 . - 160 MHz
35 . Integration
5 - FPGA pinning
10 - Interface test designs unb2c
5 - Design revisions and lab tests
15 - Technical commissioning
1 week = 100% project allocation, bruto 40 hours, netto 40 * 0.8 = 32 hours = 4 days
sprint = 100% project allocation, bruto 3 weeks, netto 12 days
v1 : 10 + 20 + 10 + 10 + 15 + 20 + 10 + 20 + 25 + 20 + 20 + 35 = 215 bruto weeks --> 215 / 40 = 5.4 FTE ~ 3 people each 2 years
v2 : 10 less for critically sampled PFB
10 more for updating existing components
10 more for ring access
30 for oversampled PFB
. consider unb2c test part of SDP FW integration and of SDP HW
15 technical commisioning relies on proper Systems Engineering, otherwise may become 50 weeks
==> EK, JH: v1 estimate of April 2019 is still valid as v2 on 10 Oct 2019.
v3 :
Infrastructure
20 - Development environment using GIT, RadioHDL, updating existing components
5 - unb2c FPGA pinning
10 - unb2c FPGA interface test designs
20 - Board Support Package using Gemini Protocol and ARGS
20 - Ring access
10 - 10GbE access (OSI 1-4)
Applications:
15 - ADC input and time stamp
10 - Subband filterbank (critically sampled)
20 - Subband correlator
10 - Beamformer
25 - Transient buffer
20 - Subband offload for AARTFAAC
20 - Transient detection
30 - Oversampled subband filterbank
0 - Support 160 MHz
Integration:
10 - Lab tests
5 - Technical commissioning Dwingeloo
5 - Technical commissioning Prototype Station
All:
20 + 5 + 10 + 20 + 20 + 10 + 15 + 10 + 20 + 10 + 25 + 20 + 20 + 30 + 0 + 10 + 5 + 5 = 255
No oversampled filterbank:
20 + 5 + 10 + 20 + 20 + 10 + 15 + 10 + 20 + 10 + 25 + 20 + 20 + 0 + 10 + 5 + 5 = 225
This diff is collapsed.
ICD interface types:
m - Mechanical (structural, loading, tooling, etc)
f - Fluid (pneumatic, cooling, heating, condensate, fuels, lubricants, waste, exhaust, feedstocks etc)
t - Thermal (cooling, heating, heatsinking, etc )
em - Electromagnetic (DC field, RF, etc)
o - Optical (numerical aperture, focal position, etc)
p- Electrical (i.e. conducted power)
e - Electronic (i.e. conducted signals or data)
eo - Electro-optical (generally signals or data)
d - Data exchange specifications (protocol stack)
h - Human-Machine Interface (special combination of some of the above)
UDP link control
- flow control = end-to-end
- congestion control = peer-to-peer within the network
. reliable transmission, at fair rate, with high resource utilization
. implemented in network layer
. also called transport protocol --> TCP ++, UDP -- (selfish protocol, low delay)
- ARP
. Tx ARP request
- UDP/IPv4
. UDP checksum (not used in LOFAR1)
> nslookup <hostname> # e.g. <astron.nl> to find IP address
> sudo arp
> ping <IP address> # to find MAC address for IP address ?
LFAA-CSP_Low : OSI (Open Systems Interconnection) layers
7 Application : Not applicable, this is the level where the STAT and CEP products each perform their
allocated functions.
6 Presentation :
- SPEAD header
header first word:
magic = 0x53 ='S' 8b, version = 0x4 8b, itemPointerWidth = 0x2 8b, HeapAddrWidth = 0x0 8b, rsvd=0 16b,
number of items = 0x8 16b
header items:
heap_counter = coarse channel number (1-511) 16b, packet counter 32b # restart at 0 for new
observation, 2k samples per packet --> packet counter wraps after few days
pkt_len = packet payload length 48b
sync_time = unix_epoch_time [s] 48b # last time system was syncrhonised by PPS in seconds since 1 Jan 1970
timestamp = timestamp [ns] 48b # time of center of first sample in packet since sync_time in
ADC sample periods of 1.25 ns
center_freq = frequency [Hz] 48b # center frequency of coarse channel (1-511) * 781250 in Hz
csp_channel_info = rsvd 16b, beam_id 16b, freq_id 16b
csp_antenna_info = substation_id (1-512) 8b, subarray_id (1-16) 8b, station_id 16b, nof_contributing_antenna
(typ. 256) 16b
sample_offset = payload_offset = 0x0
data
- 1 beam, 1 coarse channel
- sampling period is 1.25 ns * 1024 * 27/32 = 1080 ns
- 8 bit complex coarse channel samples
- Xre, Xim, Yre, Yim = 32b
- samples are in strict time order
- 2's complement
- most negative value -128 indicates error
5 Session : Controls connections (start, manage, terminate)
- SPEAD header
4 Transport : Flow control, error recovery, retransmission
- UDP [RFC 768]
- The peak data rate on a link shall be no more than 20% (TBC) above the average data rate
3 Network : addressing, routing
- IPv4 Internet Protocol
2 Data link : link between two nodes
- Ethernet standard [IEEE Std 802.3-2015], 40 GbE
1 Physical :
- Ethernet standard [IEEE Std 802.3-2015], 40 GbE
L1 ICD 11109 : STAT - CEP
. Beamlet data
. Transient buffer read out
Not included:
. SST, BST, XST, because these are for monitoring and calibration, not for science data
. Subband offload for AARTFAAC2.0 will have own EICD
STAT-CEP Beamlet data interface:
- VERSION_ID 8b
. 2,3,4 for LOFAR1
. 5 first for LOFAR2.0
- SOURCE_INFO 16b
. 2b Array ID (core station 1 LBA, 2 HBA, ...)
. 1b f_adc = 200 MHz, 160 MHz
. 1b critically PFB, oversampled PFB (or p, q for R_os = p/q)
. 4b beamlet width in number of bits (default 8 for W_beamlet = 8 bit, instead of BM = beamlet mode)
. 5b UniBoard2 FPGA id (16 FPGAs for LBA, 16 for HBA in International Station, instead of RSP ID)
. ==> Also beamlet scale setting
. ==> Number of antenna in beam (core, LBA, HBA inner to make HBA international look like HBA remote)
- CONFIGURATION_ID 8b (used in LOFAR1? intended to refer to the parset that defines this observation)
==> observation ID 32b
- STATION_ID 16b (idem as LOFAR1)
==> or 8b because there are only ~50 stations
- One packet per range of Station beamlets out of 488 beamlets
. Full band : S_sub_bf * W_beamlet * N_complex / W_byte = 488 * 8b * 2 / 8b = 976 octets
. NOF_BEAMLETS_PER_BANK not needed anymore
. nof_streams = Number of beamlet streams
- Separate destination address per stream
- LOFAR1 supports 4 streams
- LOFAR2.0 preferrably supports >> 4 streams
- beamlet_id to identify start beamlet in stream (provides more info than a stream ID)
- NOF_BEAMLETS_PER_BLOCK to identify range of beamlets from beamlet_id
- LOFAR1: beamlet_id = 0 and NOF_BEAMLETS_PER_BLOCK = 61 (dual pol beamlets, 4 streams):
- NOF_BLOCKS 16b in payload
. Multiple beamlet time slots in one packet to increase payload efficiency.
. For W_beamlet = 8 bit there can be maximum 9 blocks per payload (9 * 976 = 8784 octets < 9000)
. With nof_streams >> 4 the NOF_BLOCKS can become larger, therefore use 16b. For example:
- NOF_BEAMLETS_PER_BLOCK = S_sub_bf / nof_streams = 488 / 32 = 16
- NOF_BEAMLETS_PER_BLOCK * W_beamlet * N_complex / W_byte = 16 * 8b * 2 / 8b = 32 octets
- 9000 / 32 = 281 > 256 --> use 16b for NOF_BLOCKS
- nof_streams = 22 destination nodes, each with 8k Byte payload, possibly a double buffer:
22 * 8 kByte * 2 = 352 kByte = 176 BRAM (1 BRAM = 2 kByte, FPGA has 2713 BRAM)
- 488 / 22 = 22.18, so 488 = 4 * 23 + 18 * 22
. Only send correct data to CEP (so no need for SOURCE_INFO/payload error bit).
. How to handle blocks that got lost within the Station?
- TIMESTAMP 64b (instead of 32b seconds TIMESTAMP and 32b BLOCK_SEQUENCE_NUMBER within second)
. A 64 bit timestamp in 0.2 ns resolution since t_base = 1970 for first block in payload:
- to fit both T_adc = 5 ns and 6.4 ns
- for 116 year span since t_base = 1970 --> 2086
- BLOCK_PERIOD 16b
. bit block period in 0.2 ns resolution
. 2**16 * 0.2 ns = 13.1 us block period (block rate > 76 kHz) fits T_sub
- BSN 64b
. Block sequence number since t_base = 1970 of first block in payload, increments by 1 for every block
. Used to detect lost blocks and to align blocks from different stations
- TX_PACKET_COUNT 32b
==> Not useful, because then CEP needs to count Rx packets. Better send filler packets to keep the
packet rate at the nominal rate, so that any packet loss is due to the Network and already
clear at OSI 2 layer using lower level tools like Wireshark.
. OSI transport layer 4
. Per stream
. Started at Station power up, increments by 1 for every transmitted packet.
. To allow CEP to recognize packets that got lost on the Network, from data blocks that got lost
in the Station ring or packets that were not send because the output was disabled.
. Only transmit packets that have continous blocks / allow varying number of blocks per packet
in case a block is lost on the ring.
- Data
. X, Y paired dual polarization beamlets
\ No newline at end of file
*******************************************************************************
* Station Control software:
*******************************************************************************
The Station contains hardware and software devices that deliver the functionality of the application
[4.1.2.1]. The Station Control software consist of Control and M&C. The Control determines the behaviour
of the devices in time. Via the M&C the Control can control the devices and monitor them. The M&C uses
a standard software interface for the Control to access the devices. For the Station M&C the M&C will
use OPC-UA as standard M&C access interface for all devices [4.1.2.2]. Only in certain case there can
be an exception to not use OPC-UA [4.1.2.3.2]
The M&C system is an abstraction layer between the high level software of the Control and the low level
software or firmware in the devices [4.1.2.3]. The M&C will use the master-slave pattern to monitor a
device,so the device will only provide monitoring information on request and never by itself. In this
way the control and monitoring traffic are independent. If the device performs a certain task, then it
may provide a monitoring point that allows the master to monitor the progress. Only for low latency
events that originate in the device it may be necessary to use the publish-subscribe pattern, whereby
the slave self-generates an event message.
*******************************************************************************
* M&C of SDP firmware
*******************************************************************************
For the M&C of the SDP firmware that runs on the array of FPGAs on the UniBoard2s there will be an
SDP converter/bridge that translates between the FPGA memory map and OPC-UA [4.1.2.3.1]. Using ARGS
it may be possible to generate the device specific parts of the bridge software, because the number
of FPGAs and all register fields in the FPGA memory map are known [4.1.2.5.1].
*******************************************************************************
* Monitoring interval
*******************************************************************************
In LOFAR1 the M&C that is supported by the FPGA firmware has two flavors:
- Asynchronous (immediate)
. C the data point values are applied upon arrival of the request message.
. M the data point values are reported upon arrival of the request message.
- Synchronous (fixed at the PPS grid):
. C: the data point values are applied in the next PPS period
. M: the data point values of the previous period are reported in this PPS period.
The asynchronous M&C is suitable if the data point value is static or if its precise timing does not
have to be more accurate than what the M&C can achieve (order of 10 ms). The synchronous M&C is
suitable for data point values that need sample period accurate timing within one FPGA or between
FPGAs in parallel. The synchronous M&C can be for a single PPS instant or for every PPS instant.
- Use fixed internal sync aligned to PPS
. In LOFAR1 and APERTIF the sync period is used as fixed update interval for periodic monitoring,
periodic control (the beamformer weights) and periodic integration intervals (AST, SST, BST and
XST).
. The advantage of a fixed update interval is that it is well defined and does not need control.
This can also be a disadvantage because a fixed interval is inflexible and cannot be controlled
by the SCU. Probably only for the XST this flexibility is nice to have.
. With a fixed interval the monitored information may only reflect what happened during the previous
period. Therefore if the monitoring has to be without gaps in time then the SCU needs to monitor
and aggregate the information at every period. Using a configurable period this aggregation in the
SCU can be avoided.
. SCU must read the statistics in second between two PPS (with some 10 ms margin). This is feasible
but a strict grid.
. If the SCU reads at arbitrary time, then part of the read values may apply to this second and some
to the previous second. For most monitoring this is no problem. If necessary the SCU can wait for
PPS and then read the monitoring to ensure that it relates to the same interval on all FPGAs.
- Use single event BSN timestamp scheduler
. Gemini M&C protocol does not have timestamp activated control yet, therefore use separate BSN scheduler
control point.
. SCU can read the statistics after the scheduled BSN
. The next integration lasts until the next scheduled BSN
. The programmable interval allows arbitrary intergration intervals, which avoid the need for the
SCU to intergrate 1 s intervals in case longer intervals are needed.
. The SCU can then scale the statistics result based on the actual integration period of each
measured interval, while the intervals are still all without gaps.
. Dependent on the speed of the SCU it can use shorter integration intervals, by scheduling the next
BSN as soon as it has finished reading the statistics from the previous interval
. The BSN scheduler should also provide a monitoring value for the integration interval, i.e. the
number of block periods since the previous scheduled BSN.
. If the schedule interval is too long then the statistics and monitoring counts may overflow.
The values should then clip and not wrap, to show that they overflowed.
- Use periodic event timestamp scheduler.
. Control: The period interval is defined by a start time and a period time. If the period time is -1
then the period scheduler acts as a single event scheduler.
. Monitor: The periodic scheduler can report current time at when read, time at last event, time at
next event (or -1 for no scheduled event) and deltas cur - prev and next - cur.
. A periodic event only needs to be setup once by the SCU. The setup can be changed at any time.
. The BSN cannot be used directly, because the PPS grid does not always fit the BSN grid. Therefore use
the 64 bit timestamp with 0.2 ns resolution to schedule the start time and the period. The event will
occur at the BSN slot that is at or directly after the event time.
. Default after power up the start time of the timestamp scheduler starts at the PPS using the initial
BSN. The default period is 1 s, so 5000000000 [0.2 ns]. In this way the periodic scheduler behaves
similal as the PPS driven sync interval in LOFAR1.
. Using the 64 bit timestamp with 0.2 ns is more clear than using a BSN scheduler with fractional BSN
period control
. For short integration intervals the SCU may not be able to keep up. It is more robust to allow a
short but not necessarily constant integration interval, which is known via the monitoring point.
Instead of the periodic scheduler the SCU then schedules a new event after it has finished reading
the mointoring data from the previous event.
Behaviour of the data points:
- Asynchronous:
. Only clear data points on control write access, so not as side effect of a monitor read access
- Synchronous:
. Dual page data points swap or shift page at a synchronous event, to provide a precisely timed
and stable data value that can be written for control before the event or read for monitor after
the event.
- Apertif MM registers
. Async :
- ETH control and status
- WDI
- UNB_SENS
- COMMON_PULSE_DELAY
- ADC_QUAD
- FIL_COEFS
- SS_REORDER
- DIAGNOSTICS_BACK counts clear after dedicated write access
- TR_NONBONDED_BACK
- DP_RAM_FROM_MM
- BF_WEIGHTS
- DP_PKT_MERGE
- DP_SPLIT
- DP_SWITCH
- DP_SYNC_CHECKER side effect counts clear when read
- DP_BSN_ALIGN_INPUT
- DP_FIFO_FILL
- DP_XONOFF_OUTPUT
- DP_OFFLOAD_RX_HDR_DAT
- DP_OFFLOAD_TX_HDR_DAT
- DPMM_CTRL
- DPMM_DATA
- MMDP_CTRL
- MMDP_DATA
- IO_DDR
- DP_XONOFF_OUTPUT
- DP_OFFLOAD_TX
- TR_XAUI
- MDIO_0
- TR_10GBE
- EPCS
- REMU
. Async, restart immediate after last write
Sync, restart by external sync from PPS, BSN scheduler, PPS after write, or
- I2C master
- DIAG_WG
- DIAG_BG
- DP_SHIFTRAM
- BSN_SOURCE
. Sync, generate single event at BSN
- BSN_SCHEDULER_WG
. Sync, single page, periodic event latch value at every sosi.sync
- ADUH_MON (mean, sum)
- BSN_MONITOR
. Sync, single page, periodic event store values at every sosi.sync, or
Async store data after last read
- ADUH_MON (buffer)
- DIAG_DATA_BUFFER
. Sync, dual page monitor, periodic event latch sum values and restart integration at every sosi.sync
- ST_SST
. SYnc, dual page control, periodic event page swap at sync when last value was written (so only then swap)
- DP_FRINGE_STOP_OFFSET
Conclusion:
- Identify casue of error preferrably via a single monitoring point
- With proper monitoring no test time is needed
- Support writing status fields in a test mpd for SW - FW interface testing
- Use 1 s sync interval of PPS to time period M&C events for all. Optionally support a local BSN scheduler
for the XST.
This diff is collapsed.
This diff is collapsed.
Semi-floating point values for SST, XST and BST
int2float.vhd uses a 32bit semi-float with 1 bit exponent and 31 bit mantissa
- float_w = 32
- int_w = 54
- exp_w = 1
- mantissa_w = float_w - exp_w = 31
- mantissa = 2**31
- base_w = int_w - mantissa_w = 23
- base = 2**base_w = 2**23
- if int in range -mantissa/2 to +mantissa/2-1 then exp = 0 and float = int
else exp = 1 and float = round(int / base)
- if exp = 0 --> int = mantissa
- if exp = 1 --> int = mantissa * base
53 52 51 50 49 48 47 46 45 ... 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 ... 4 3 2 1 0
<s--s--s--s--s--s--s--s--s-...--s--s><s---------mantissa-----------------------...-------------->
53 52 51 50 49 48 47 46 45 ... 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 ... 4 3 2 1 0
<s------mantissa-----------...------------------------------><------2**23------...-------------->
The SNR decrease due to quantization is:
(s/s0)^2 = 1 + 1/(12*g^2)
where s = sigma, s0 = sigma of sky noise, d = quantization step, g = s0 / d.
After integration over N powers the SNR improves by a factor sqrt(N), so the integrated power S of s needs
to be represented by log2(sqrt(N)) extra bits to contain the processing gain of the incoherent integration
plus twice as much bits as s to contain the power s^2. The sqrt(N) processing gain for integrated powers is
also derived in MEM-131 of ARTS by SJW:
(s/s0)^2 = 1/N + 1/(12*g^2)
For N = 1 there is no integration, so then the SNR decrease is defined by the input quantization. For N > 1
the integration improves the SNR by a factor N, so the quantization needs to become finer as well.
The power value of s^2 needs twice as many bits, to contain the whole range, but only half of these are
significant, because squaring a number does not add information. Hence for sigma s with 4 bits including
sign bit and N = 195312.5 so log2(sqrt(195312.5)) = 8.8 bits this results in at least 4 + 8.8 = 12.8 bit.
The power values use twice as many bit, so 2 * 12.8 - 1 sign = 24.6 bits or about 25 bits to represent
the SST and XST power values. For the BST the BF also provides a processing gain for incoherent noise
of log2(sqrt(96)) = 3.3 bits, so the BST need about 12.8 + 3.3 = 16.1 bits. The power values use twice as
many bits, so 2 * 16.1 - 1 sign = 31.2 bits or about 31 bits to represent the BST power values.
The semi-floating point value of 32 bit with 1 bit exponent can represent 31 bit values without extra
rounding, which is suitable to represent the SST, XST and BST values for measurements without RFI or
weak RFI. The exponent = 1 representation is suitable and needed to represent the strong RFI signals.
The range of the semi-floating point values for the powers can be increased by first rounding the powers
by up to 4 + 8.8 = 12.8 LSbits, because these power bits are not significant. A safe value would be to
round e.g. 8 LSbits of the power int values before converting them to semi-floating point. The calculated
power values then become log10(2**8) = 24 dB lower.
Note:
- The ADC sign bit is also a bit that counts in the SNR as 6 dB. The sigma value is positive by definition.
For (s/s0)2 = 1.01, so 1 % worse SNR due to quantization, g = 3.53 and s0 = 3.53 d. Hence s0 is log2(3.53)
= 1.8 bit including sign bit (1.8b = 10.8 dB). For Gaussian noise the -3 to +3 sigma range contains 99 %
of the values. The 3 sigma corresponds to log2(3) = 1.6 bit. In total the ADC 3 sigma input then fits in
3.4 bit, so use 4 bit as a practical lower limit for ADC input quantization with negligible quantization
loss.
- The rounding of a large value A can cause that +A and - A becomes +B and -B+1, when rounding is done
to + infinity (which is what round() does). In the LOFAR 1.0 subband correlator this caused confusion,
because both the X*Y and Y*X were calculated, which should agree to X*Y = conj(Y*X), but due to rounding
affect could differ by 1. For LOFAR 2.0 this can again occur for the cross correlation of the local
signal inputs per PN. The cross correlations with the remote signal inputs are calculated only once, so
there the rounding affect is not noticed. To avoid this difference common_int2float.vhd in LOFAR 1.0
can use g_symmetric=TRUE, which applies truncating to zero.
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment