erko_firmware_dag.txt

Author: Eric Kooistra, jan 2018
Title: Status of FPGA firmware devlopment at DESP

Purpose:
- Explain how we currently develop FPGA firmware at DESP

1) Develop FPGA hardware boards
  - Review board design document and schematic, so that the board will not contain major bugs and
    so that firmware engineers can already learn about the board and get familiar with it
  - Pinning design to verify schematic
  - Vendor reference designs to verify the IO
  - Heater design to verify the cooling and the power supplies

  - Board architecture:
    . RSP ring with 4 AP (with ADC) and 1 BP
    . UniBoard1 mesh with 4 BN (with ADC) and 4 FN, 4 transceivers per 10GbE, DDR3
    . UniBoard2 4 PN, 1 transceiver per 10GbE, DDR4
    . Gemini 1 FPGA, 25 Gb  transceivers, DDR4, HBM
    
2) Technology independent FPGA:
  - Wrap IP (IO, DSP, memory, PLL) --> needed for board_heater design, board_minimal, board_test
  - Use to support:
    * different vendors:
      . Xilinx (LOFAR, SKA CSP Low)
      . Altera (Aartfaac, Apertif, Arts)
    * FPGA type and sample versions
    * synthesis tool versions
    
3) Board firmware
  - board_minimal design that provides control access to the FPGA board and the board control functions.
    uses the monitoring and control protocol via the MM bus (UniBoard using Nios and UCP, Gemini using hard
    coded Gemini protocol)
  - board_test design that contain the minimal design plus interfaces to use the board IO (gigabit transceivers
    , DDR)
  - board library back, mesh models
  - pinning files 
   

4) Oneclick
  - OneClick is our umbrella name for new ideas and design methods ('ideeen vijver'), focus on firmware specifcation and design.
    New tools that are created within OneClick may end up as part of the RadioHDL envirionment. This has happened for example
    with ARGS. Oneclick is about 'what' we could do, RadioHDL is about 'how' we do it.
  - The name OneClick relates to our 'goal at the horizon' to get in rone click from design to realisation.
  - Automate design flow
  - Array notation (can be used in document and in code --> aims for simulatable specification)
  - Modelling in python of data move, DSP and control
  - We now work with data move libraries, but could we not better program these adhoc in 1 process?

 
5) RadioHDL
  - Board toolset (unb1, unb2a, rsp, gmi, etc) to manage combinations of board version, FPGA version, tool version
  - RadioHDL is our umbrella name for set of tool scripts that we use for firmware development, focus on implementation.
  - The name RadioHDL covers HDL code for RadioAstronomy as a link to what we do at Astron. However by using only the word Radio we keep
    the name a bit more general, because in fact the RadioHDL tool scripts can be used for any (FPGA) HDL development.
  - Automate implementation flow (source --> config file --> tool script --> product, a product can be the source of a next product)
  - Organize code in libraries using hdllib.cfg
  - Manage tool versions using hdltool_<toolset name>.cfg
  - Create project files for sim and synth
  - ARGS (Automatic Register Generation System using MM bus and MM register config files in yaml)
    . add more configuration levels:
       peripheral configuration yaml
       fpga configuration yaml
       board yaml (board with 1 or more FPGA)
       application (application image on one or more FPGA)
       system (one or more application images that together form the entire FPGA system)
    . add constants configuration yaml --> to define terminology and parameter section in specification document and to use
      these also in firmware and software.
  - Create FPGA info (used to be called system info) address map stored in FPGA to allow dynamic definition of address maps. The definition
    of the MM register fields is kept in files because it typically remains fixed.
  - Easily enroll the environment on a new PC and introduce a new employee (to be done)
  

6) VHDL design:
  - Clean coding
  - Reuse through HDL libraries
  - Standard interfaces: MM & ST, support Avalon, AXI using VHDL records mosi/miso, sosi/siso
  - Build FPGA appliciation design upon a board minimal design and the relevant IO from the board test design
  - dp_sosi :
    . data, re/im  : real or complex data
    . valid        : strobe to indicate clock cycles that carries valid data samples, not needed for ADC input
    . sop, eop     : strobes to indicate start of packet and end of packet for blocks of data
    . sync and bsn : timing strobe, block sequence number is timestamp, alignment of parallel streams
    . channel      : valid at sop to multiplex multiple channels in one stream
    . empty        : valid at eop
    . error        : valid at eop
  - dp_siso:
    . ready : backpressure flow control per data valid, only used for components that realy need it to avoid complexity and 
              to ease timing closure. The ready can be pipelined with dp_pipeline_ready.vhd. The ready flow control is e.g.
              used to insert a header in front of data blocks to create a packet.
    . xon   : backpressure flow control per block of data. The xon flow control is used to stop the input source to avoid
              overflow internal FIFOs. Together these FIFOs must at least be capable to store the current blocks. Our 
              applications are data driven, so making xon low will cause data to be dropped. For an application that read 
              data from a disk like in the all data storage systems, the xon can be used read the disk as fast as possible 
              by applciation, so DSP driven !!!.
    
  - Synthesis tool ensures that the logic per clock cycle is reliable, we have to ensure at functional level that only
    complete blocks of data are being passed on !!!:
    . Incomplete blocks must be dropped at the input
    . FIFOs should never overflow and should not be reset. Avoid overflow by using xon. Clear a FIFO by reading it empty
   
  - Streaming data versus store and forward !!!
    . dp_bsn_aligner.vhd, aligns input streams using the BSN
      - The resource usage of the dp_bsn_aligner in Apertif Correlator (14 dec 2018) is:
               nof       ALM            FF
               streams   align    MM    align    MM
         input   3         502   213      657   319
         mesh    8        1162   544     1346   784
      - Fill level of dp_bsn_aligner input FIFOs in Apertif Correlator (18 dec 2018) measured with util_dp_fifo_fill.py is:
                  fifo size                  max (min/max)        used (min/max)
                                             1st time  2nd time 
          input   (3+5)*176 (or 180) = 1408  178/762   178/504    139/428
          mesh    (4+3)*88 (or 120)  =  618  159/516   150/262     64/255
          
    . dp_sync_checker.vhd, detect incomplete sync intervals, these are recoved using data from the next sync interval, so
        the next sync interval will get lost. To avoid this would require to store and forward the data of a sync interval
        because then it is possible to fill in missing blocks with dummy data. With store and forward it is also possible
        to recover block order if necessary. The disadvantage of store and forward is latency and memory. Store and forward
        is the general concept for how software operates (on CPU and GPU).
        This scheme of sync interval recovery is only acceptable if dropped packets occur very rarely, because if one
        stream has a dropped packet then the output of the BSN aligner and sync checker will drop a sync interval. E.g.
        apertif X needs to aligne N_dish * N_pol = 12*2 = input streams and sync interval = 1.024 s. These input streams 
        come from 10GbE links. A bit error rate of 1e-10 means 1 bit error per s per link. A bit error will cause CRC error
        and assume that then the packet gets dropped, this then causes that the BSN aligner cannot align that block and
        that will cause that one sync interval gets corrupted and the next will get lost. After that the BSN aligner will
        have recovered. Suppose this should only occur once per 8 hour observation = 28800 s. So with 24 links the BER 
        per link should then be less than 1e-10 / 28800 / 24 ~= 1e-16 or 1e-17, so only 1 per month.
        
     . dp_packet_rx.vhd, ensure that only complete packets enter the FPGA
     . FIFO overflow is a bug, as serious as a FPGA logic error
     
  - Pass on sosi.info fields along a function that only needs data and valid
    . dp_fifo_fill           --> use FIFOs to delay sop info and eop info with variable latency
    . dp_paged_sop_eop_reg   --> use array pf of register pages to delay sop info and eop info by fixed latency. If
                                 the latency is many sops or if only sync and BSN need to be passed on, then consider
                                 using dp_block_gen_valid_arr
    . dp_block_gen_valid_arr --> recreate sync, local BSN, sop, eop based on valid and pass on global BSN at sync or at
                                 all sop. Usefull if the latency is >= 1 sync intervals or many sops.
  
  - Component improvements:
    . Verify flow control in tb of dp_offload_rx and dp_offload_tx_dev (wrapper of dp_concat_field_blk.vhd)
    . reorder_matrix.vhd with timestamp accurate page swap
    . dp_fifo_fill_eop.vhd : fill FIFO with one block instead of some number of words to avoid that FIFO cannot be read empty
    . dp_bsn_aligner.vhd:
      - A dp_bsn_aligner without flow control would make it much simpler.
      - A further simplification is to make a dp_sync_aligner that only can recover alignement at a sync, instead of at
        every sop (via the BSN).
      - Instead of xoff_timeout it is also possible to wait until the FIFO has been read empty for all inputs.
    
     
  - Timing and sync intervals
    . At the ADC input the BSN timestamps are attached to the block data. The block size for the BSN depends on the length of
      the FFT. This BSN relates the data to UTC. MAC initializes the BSN relative to 1 jan 1970.
    . With ADC clock of 800MHz and FFT size of 1024 this yields 800M/1024 = 781250 subbands per sec. We process the data at
      200MHz so we have 4 streams in parallel, each with 781250/4 = 195312.5 blocks per sec. In LOFAR we has also such 
      a situation and there we define odd and even second sync intervals. The even interval then has 195313 blocks and the
      odd interval than has 195312 intervals. This was awkward for control. In Apertif a similar fractional block issue
      occured in the correlator with 781250 / 64 = 12207.03125 channels per second. Therefore for Apertif we increased the sync
      to 1.024 s, such that we have 800000 / 64 = 12500 channels per sync interval. Now we do not have 
      even and odd seconds anymore but still this 1.024 s sync interval is also akward because it does not align with the
      1 s grid that human use and that also other parts of the telescope use. 
      Possible solutions for future systems would be to use a sampling frequency that is a multiple of the FFT size, so
      e.g. 809.6MHz with FFT size = 1024, or 800MHz with FFT size = 800. These schemes have the additional advantage that
      then the subband bandwidth is 1 MHz which fits the typical band width grid in VLBI and it also fits the fact that the
      Apertif LO can be tuned in steps of 10MHz. With subband bandwidth of 781250 Hz only once every 50 MHz the subbands
      align with the 10MHz grid, because 64*781250 = 50M.
    . Using an oversampled filterbank introduces yet another block grid. For example with 32/27 and an FFT block size
      of 1024 the oversampled block size becomes 1024 * 27/32 = 864. This oversampled 864 block grid only aligns with
      the 1024 block grid once every 27 blocks of size 1024. For Apertif the 781250 blocks of 1024 align with the 1
      sec grid, but the 32/27 oversampled blocks will only align every 27 sec. Hence with oversampling it is necessary to
      accept that it becomes impossible to main block alignment within a 1 second grid.
    . In APERTIF the misalignment between the channel period and the one second grid was avoided by defining a sync
      interval of 1.024 s and use that sync interval as integration period. A sync interval of 1.024 s for LOFAR would mean
      that a sync interval contains 160000 blocks at f_adc = 160M and 200000 blocks at f_adc = 200 MHz. However if other
      parts of the system rely on a one second or e.g. ten second grid, then using a 1.024 second grid does not fit well
      with those parts. Using an oversampled filterbank introduces yet another block grid. For example with r_os = 32/27
      and an FFT block size N_fft = 1024 the oversampled block size becomes M_blk = 1024 * 27/32 = 864. This oversampled
      M_blk = 864 block grid only integer aligns with the one second grid once every 27 seconds, because 200M / 864 * 27
      = 6250000 and 160M / 864 * 27 = 5000000 yield an integer. The alternative would be to define a sync interval that is
      an integer multiple of M_blk and close to 1 s. Preferably T_int is the same for f_adc = 200M and 160MHz. The ratio
      160M / 200M = 4 / 5, so choose the sync interval to be a multiple of 4 * 5 * 864 = 17280 blocks. This then yields
      e.g. ceil(200M / 17280) * 17280 = 200016000 and ceil(160M / 17280) * 17280 = 160012800, which both correspond to
      T_int = 1.00008 s exact. However LOFAR 2.0 needs to be compatible with LOFAR 1.0, so the fact that 1.00008 != 1
      will cause misalignment regarding the statistics like SST, BST, XST from a LOFAR 1.0 station and a LOFAR 2.0
      station. Furthermore to read the statistics and update the BF weights the LCU needs to keep track of the 1.00008 s
      grid. Therefore it is best to keep the one second grid and accept that some sync intervals contain 1 block more than
      the other sync intervals. For the critically sampled filterbank as in LOFAR 1.0 with r_os = 1 this yields
      200M / 1024 = 195312.5 blocks per second on average, so the number of blocks per sync interval then repeats with
      period 2 s as: 195312 + 0,1. For the oversampled filterbank with e.g. r_os = 32/27 and M_blk = 864 this would yield
      200M / 864 = 231481.481 blocks per second on average, so the number of blocks per sync interval then repeats with
      period 27 s as: 231481 + 0,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,1 because 13 / 27 = 0.481. The
      variation in number of blocks per sync interval is sufficiently small, 1/231481 = 4.3e-6, such that it does not
      significantly affect the accuracy of the statistics values per sync interval.

        
  - Flow control
    
    . dp_siso : ready and xon  
   
  - Useful libraries and packages:
    . base: common, dp, mm, diag, reorder, uth
    . dsp: wpfb, bf, correlator, st
    . io: eth, io_ddr, i2c

7) Applications:
  - Build upon reused libraries. 
  - New functions are first added as libraries and then used in the application
  - Qsys only used for the MM bus generation


8) VHDL testing:
  - detailed unit tests per HDL library using entity IO
    . verify corner cases
    . often use stimuli --> DUT --> inverse DUT --> expected results
      e.g.
         rx - tx
         encode - decode
         mux - demux
    . sometimes the same component can suppot both directions:
         dp_repack
         dp_deinterleave
    
  - integration top level or multi FPGA tests using MM file IO
    . MM file IO for testbenches at design level, 'breaking the hierarchy' in VHDL or providing access to Modelsim simulation with Python
    . do not test the details those must be covered in the unit tests
  - regard the firmware as a data computer, so independent of its functional (astronomical) use we need to verify and validate that for a 
    known stream of input data it outputs the expected output data.
  - detailed unit tests per HDL library using entity IO
  - integration top level or multi FPGA tests using MM file IO
    . MM file IO for testbenches at design level, 'breaking the hierarchy' in VHDL or providing access to Modelsim simulation with Python
  - regard the firmware as a data computer, so independent of its functional (astronomical) use we need to verify and validate that for a 
    known stream of input data it outputs the expected output data.
  - Verification via simulation:
    . use of g_sim, g_sim_record to differentiate between simulation and hardware
    . use g_design_name to differentiate between revisions, e.g. to speed up simulation or synthesis
    . behavoral models of external IO (DDR, Transceivers, ADC, I2C)
    . break up data path using WG, BG, DB, data force
    . optional use of transparant DSP models to pass on indices.
    . verify data move by transporting meta data (indices) via the sosi data fields
    . profiler to know time consuming parts
  - VHDL regression test (if not there, then it is not used)
  - Validation on hardware
    . using Python peripherals for MM control using --cmd options per peripheral
    . construct more complicated control scripts using sequence of peripheral scripts and --cmd
    . we need proper data capture machines, to validata 10G, 40 GbE data output (e.g. using wireshark and some Python code)

    
9) Documentation
  - Documentation is needed to specify what we have to make
    . Detailed design document uses array notation to cleary describe all internal and external interfaces
    . Detailed design document also identifies test logic that is needed for the integration top level tests
  - No need to document what we have made, except for readme file and manuals
  - The code is self explanatory (with comment in docstring style using purpose and description)
  - The project scripts identify what is relevant for a product
  - The regression tests identify what is relevant code (if it is not tested it is not important and should not have been made)
  - It would be nice to have YouTube movies that show our workflow and boards


10) Project planning
  - Wild ass guess based on time logs of previous projects
  - System engineering design approach for total product life cycle
  - Agile style with scrum and sprints
    . If it is not an allocated epic/story/task in Redmine then it will not be done
  - Roles within the team
    . System architects remain actively involved during entire project to ensure that design ideas are preserved or 
      correctly adjusted
  - Whiteboard meetings to steer detailed design
    . with wide team to get common understanding and focus
  - Definition of done 
  - What maintenance support do we provide after a project has finished
    . firmware tends to become hardware in time, ' het verstaft'
    . using virtual machines (dockers) to bundle a complete set of operating system, tools and code for the future or to
      export as a starting point to an external party (e.g. for outsourcing)
    
11) Ethernet networks
  - 1GbE, 10GbE, 40GbE, 100GbE IP
  - Knowledege of switches
  - Knowledege of UDP, IP, VLAN
  - Monitoring and Control protocol (UniBoard, Gemini)
  - Streaming data offload
12) Outreach, papers, collaborations, recruiting
  - Oliscience opencores
  - NWO digital special interest group
  - student assignments
  - Write paper on ARGS (done by Mia @ CSIRO)
  - Write paper on RadioHDL (= also intro paper / user guide for RadioHDL on OpenCores)
  - Write paper on RL = 0 coding style with state reg and pipeline reg clearly separated. The design should also work
    without pipeline. Possibly the pipelining should be added automatically and only where needed.
  
13) DESP pillars
  - All data storage
  
  
14) Externe info
  * Technolotion in B&C 2019/2
    - own IP libraries
    - self-checking testbenches made by developer
    - own HDL implementation of opensource Risc-V-Processor (instruction set architecture) can run Linux
    - generic build server for simulation and synthesis (with all tool versions)
    - regression test using nightly build
    - version control using GIT (merge request --> review by collegue --> discussion via GIT server)
    - HW regression test using a stimuli generator (e.g. video)
  * High Tech Institute: System Configuration Management
    - start with a model of the company processes
    - first organize then automate
    - baselinen is create timestamp versions numbers of components (HW, SW)
  * Dutch system architecting conference 20 june 2019 Den Bosch