From 515e077176961813a699d29d0ce78c21cb95b384 Mon Sep 17 00:00:00 2001
From: Jan David Mol <mol@astron.nl>
Date: Fri, 8 Oct 2021 09:56:29 +0200
Subject: [PATCH] L2SS-434: More XST/SST faqs, added deep clean procedure

---
 docs/source/developer.rst    |  8 ++-
 docs/source/faq.rst          | 99 ++++++++++++++++++++++++++++++++++++
 docs/source/index.rst        |  1 +
 docs/source/installation.rst |  2 +
 4 files changed, 109 insertions(+), 1 deletion(-)
 create mode 100644 docs/source/faq.rst

diff --git a/docs/source/developer.rst b/docs/source/developer.rst
index 169ed0ef5..e14e3350d 100644
--- a/docs/source/developer.rst
+++ b/docs/source/developer.rst
@@ -12,7 +12,11 @@ The docker setup is managed using ``make`` in the ``docker-compose`` directory.
 - ``make build <container>`` to rebuild the image for the container,
 - ``make build-nocache <container>`` to rebuild the image for the container from scratch,
 - ``make restart <container>`` to restart a specific container, for example to effectuate a code change.
-- ``make clean`` to remove all images, containers, and volumes.
+- ``make clean`` to remove all images and containers, and the ``tangodb`` volume. To do a deeper clean, we need to remove all volumes and rebuild all containers from scratch::
+
+  make clean
+  docker volume prune
+  docker build-nocache
 
 Since the *Python code is taken from the host when the container starts*, restarting is enough to use the code you have in your local git repo. Rebuilding is unnecessary.
 
@@ -32,6 +36,8 @@ The networks are defined in ``docker-compose/networks.yml``:
 
 The ``$NETWORK_MODE`` defaults to ``tangonet`` in the ``docker-compose/Makefile``.
 
+.. _corba:
+
 CORBA
 ````````````````````
 
diff --git a/docs/source/faq.rst b/docs/source/faq.rst
new file mode 100644
index 000000000..361aac78c
--- /dev/null
+++ b/docs/source/faq.rst
@@ -0,0 +1,99 @@
+FAQ
+===================================
+
+*Q: My device is unreachable, but the device logs say it's running fine.*
+
+The ``$HOSTNAME`` may have been incorrectly guessed by ``docker-compose/Makefile``, or you accidently set it to an incorrect value. See :ref:`corba`.
+
+*Q: I get "API_CorbaException: TRANSIENT CORBA system exception: TRANSIENT_NoUsableProfile" when trying to connect to a device.*
+
+The ``$HOSTNAME`` may have been incorrectly guessed by ``docker-compose/Makefile``, or you accidently set it to an incorrect value. See :ref:`corba`.
+
+*Q: The elk container won't start, saying "max virtual memory areas vm.max_map_count [65530] is too low"?*
+
+The ELK stack needs the ``vm.max_map_count`` sysctl kernel parameter to be at least 262144 to run. See :ref:`elk-kernel-settings`.
+
+*Q: How do I prevent my containers from starting when I boot my computer?*
+
+You have to explicitly stop a container to prevent it from restarting. Use::
+
+  cd docker-compose
+  make stop <container>
+
+or plain ``make stop`` to stop all of them.
+
+*Q: Some SSTs/XSTs packets do arrive, but not all, and/or the matrices remain zero?*
+
+So ``sst.nof_packets_received`` / ``xst.nof_packets_received`` is increasing, telling you packets are arriving. But they're apparently dropped or contain zeroes. First, check the following settings:
+
+- ``sdp.TR_fpga_mask_RW[x] == True``, to make sure we're actually configuring the FPGAs,
+- ``sdp.FPGA_processing_enabled_R[x] == True``, to verify that the FPGAs are processing, or the values and timestamps will be zero,
+- For XSTs, ``xst.FPGA_xst_processing_enabled_R[x] == True``, to verify that the FPGAs are computing XSTs, or the values will be zero.
+
+Furthermore, the ``sst`` and ``xst`` devices expose several packet counters to indicate where incoming packets were dropped before or during processing:
+
+- ``nof_invalid_packets_R`` increases if packets arrive with an invalid header, or of the wrong statistic for this device,
+- ``nof_packets_dropped_R`` increases if packets could not be processed because the processing queue is full, so the CPU cannot keep up with the flow,
+- ``nof_payload_errors_R`` increases if the packet was marked by the FPGA to have an invalid payload, which causes the device to discard the packet,
+
+*Q: I am not receiving any XSTs and/or SSTs packets from SDP!*
+
+Are you sure? If ``sst.nof_packets_received`` / ``xst.nof_packets_received`` is actually increasing, the packets are arriving, but are not parsable by the SST/XST device. If so, see the previous question.
+
+Many settings need to be correct for the statistics emitted by the SDP FPGAs to reach our devices correctly. Here is a brief overview:
+
+- ``sdp.TR_fpga_mask_RW[x] == True``, to make sure we're actually configuring the FPGAs,
+- ``sdp.FPGA_communication_error_R[x] == False``, to verify the FPGAs can be reached by SDP,
+- SSTs:
+
+  - ``sst.FPGA_sst_offload_enable_RW[x] == True``, to verify that the FPGAs are actually emitting the SSTs,
+  - ``sst.FPGA_sst_offload_hdr_eth_destination_mac_R[x] == <MAC of your machine's mtu=9000 interface>``, or the FPGAs will not send it to your machine. Use f.e. ``ip addr`` on the host to find the MAC address of your interface, and verify that its MTU is 9000,
+  - ``sst.FPGA_sst_offload_hdr_ip_destination_address_R[x] == <IP of your machine's mtu=9000 interface>``, or the packets will be dropped by the network or the kernel of your machine,
+  - ``sst.FPGA_sst_offload_hdr_ip_destination_address_R[x] == 5001``, or the packets will not be sent to a port that the SST device listens on.
+
+- XSTs:
+
+  - ``xst.FPGA_sst_offload_enable_RW[x] == True``, to verify that the FPGAs are actually emitting the SSTs,
+  - ``xst.FPGA_xst_offload_hdr_eth_destination_mac_R[x] == <MAC of your machine's mtu=9000 interface>``, or the FPGAs will not send it to your machine. Use f.e. ``ip addr`` on the host to find the MAC address of your interface, and verify that its MTU is 9000,
+  - ``xst.FPGA_xst_offload_hdr_ip_destination_address_R[x] == <IP of your machine's mtu=9000 interface>``, or the packets will be dropped by the network or the kernel of your machine,
+  - ``xst.FPGA_xst_offload_hdr_ip_destination_address_R[x] == 5002``, or the packets will not be sent to a port that the XST device listens on.
+
+If this fails, see the next question.
+
+*Q: I am still not receiving XSTs and/or SSTs, even though the settings appear correct!*
+
+Let's see where the packets get stuck. Let us assume your MTU=9000 network interface is called ``em2`` (see ``ip addr`` to check):
+
+- Check whether the data arrives on ``em2``. Run ``tcpdump -i em2 udp -nn -vvv -c 10`` to capture the first 10 packets. Verify:
+
+  - The destination MAC must match that of ``em2``, 
+  - The destination IP must match that of ``em2``,
+  - The destination port is correct (5001 for SST, 5002 for XST),
+  - The source IP falls within the netmask of ``em2`` (unless ``net.ipv4.conf.em2.rp_filter=0`` is configured),
+  - TTL >= 2,
+
+- If you see no data at all, the network will have swallowed it. Try to use a direct network connection, or a hub (which broadcasts all packets, unlike a switch), to see what is being emitted by the FPGAs.
+- Check whether the data reaches user space on the host:
+
+  - Turn off the ``sst`` or ``xst`` device. This will not stop the FPGAs from sending.
+  - Run ``nc -u -l -p 5001 -vv`` (or port 5002 for XSTs). You should see raw packets being printed.
+  - If not, the Linux kernel is swallowing the packets, even before it can be sent to our docker container.
+
+- Check whether the data reaches kernel space in the container:
+
+  - Enter the docker device by running ``docker exec -it device-sst bash``.
+  - Run ``sudo bash`` to become root,
+  - Run ``apt-get install -y tcpdump`` to install tcpdump,
+  - Check whether packets arrive using ``tcpdump -i eth0 udp -c 10 -nn``,
+  - If not, Linux is not routing the packets to the docker container.
+
+- Check whether the data reaches user space in the container:
+
+  - Turn off the ``sst`` or ``xst`` device. This will not stop the FPGAs from sending.
+  - Enter the docker device by running ``docker exec -it device-sst bash``.
+  - Run ``sudo bash`` to become root,
+  - Run ``apt-get install -y netcat`` to install netcat,
+  - Check whether packets arrive using ``nc -u -l -p 5001 -vv`` (or port 5002 for XSTs),
+  - If not, Linux is not routing the packets to the docker container correctly.
+
+- If still on error was found, you've likely hit a bug in our software.
diff --git a/docs/source/index.rst b/docs/source/index.rst
index cc731b347..524d21369 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -23,6 +23,7 @@ Even without having access to any LOFAR2.0 hardware, you can install the full st
    devices/configure
    configure_station
    developer
+   faq
 
 
 Indices and tables
diff --git a/docs/source/installation.rst b/docs/source/installation.rst
index 2cfb177a1..cb0122ae9 100644
--- a/docs/source/installation.rst
+++ b/docs/source/installation.rst
@@ -76,6 +76,8 @@ The following commands start all the software devices to control the station har
 
 See :ref:`boot` for more information on the ``boot`` device.
 
+.. _elk-kernel-settings:
+
 ELK
 ````
 
-- 
GitLab