Merge branch 'nomad-jumppad-documentation' into 'master'

L2SS-1679 L2SS-1680: Added some nomad/jumppad documentation See merge request !848

Merge branch 'nomad-jumppad-documentation' into 'master'
0c8d09dd · Jan David Mol · 8394ed9c · d2b0016b · 0c8d09dd
Commit 0c8d09dd authored Mar 11, 2024 by Jan David Mol
--- a/infra/README.md
+++ b/infra/README.md
+[[_TOC_]]
+# HOWTO
+## Login to the client VM
+Almost all docker containers of the software stack run within the client VM,
+which is actually run as a docker container in the development environment.
+This container can be identified programmatically as follows. Use the additional
+``-q`` parameter to obtain just the container ID:
+```
+$ docker ps --filter 'name=client.station.nomad.nomad-cluster.jumppad.dev'
+CONTAINER ID   IMAGE                     COMMAND                  CREATED         STATUS         PORTS     NAMES
+90f6f253fb58   shipyardrun/nomad:1.6.1   "/usr/bin/supervisor…"   3 minutes ago   Up 3 minutes             fee02e87.client.station.nomad.nomad-cluster.jumppad.dev
+$
+```
+You can login interactively to this container using ``sh``:
+```
+$ CLIENT_CONTAINER_ID=$(docker ps -q --filter 'name=client.station.nomad.nomad-cluster.jumppad.dev')
+$ docker exec -it ${CLIENT_CONTAINER_ID} sh
+#
+```
+## Attach to a running client process
+The bulk of our client processes run within the client "VM", which has docker running as well:
+```
+$ docker exec "${CLIENT_CONTAINER_ID}" docker ps -a
+CONTAINER ID   IMAGE                                              COMMAND                  CREATED              STATUS              PORTS     NAMES
+00ba94025605   git.astron.nl:5000/lofar2.0/tango/grafana:latest   "/run-wrapper.sh"        12 seconds ago       Up 11 seconds                 grafana-de5690ec-219f-51d5-2946-f2f5133e9612
+c30d367c387f   git.astron.nl:5000/lofar2.0/tango/loki:latest      "/usr/bin/loki -conf…"   About a minute ago   Up About a minute             loki-e9073864-c281-7390-1b03-8c10fa072e73
+7d1d7990031c   envoyproxy/envoy:v1.26.4                           "/docker-entrypoint.…"   About a minute ago   Up About a minute             connect-proxy-loki-e9073864-c281-7390-1b03-8c10fa072e73
+3f02415bd6f6   gcr.io/google_containers/pause-amd64:3.1           "/pause"                 About a minute ago   Up About a minute             nomad_init_e9073864-c281-7390-1b03-8c10fa072e73
+02a644ecf279   git.astron.nl:5000/lofar2.0/tango/postgres:15.4    "docker-entrypoint.s…"   About a minute ago   Up About a minute             postgres-4d957b81-0de1-0bc0-425b-743b28ba6a8e
+[...]
+```
+You can interact with these containers by logging into the client VM, or directly by chaining docker calls:
+```
+$ docker exec "${CLIENT_CONTAINER_ID}" docker logs grafana-de5690ec-219f-51d5-2946-f2f5133e9612
+Wait until grafana is ready...
+logger=settings t=2024-02-13T13:14:47.766739611Z level=info msg="Starting Grafana" version=10.3.1 commit=00a22ff8b28550d593ec369ba3da1b25780f0a4a branch=HEAD compiled=2024-01-22T18:40:42Z
+logger=settings t=2024-02-13T13:14:47.767369632Z level=warn msg="ngalert feature flag is deprecated: use unified alerting enabled setting instead"
+logger=settings t=2024-02-13T13:14:47.767679942Z level=info msg="Config loaded from" file=/usr/share/grafana/conf/defaults.ini
+logger=settings t=2024-02-13T13:14:47.767713444Z level=info msg="Config loaded from" file=/etc/grafana/grafana.ini
+[...]
+```
+This allows you to use the regular docker commands like ``attach``, ``logs``, and ``restart``. Note that any interactive use requires ``-it`` for in the top lin ``docker exec -it "${CLIENT_CONTAINER_ID}"``.
+## Patching a device server live
+Sometimes it is handy to modify the tangostationcontrol source code for a running device server. To do so:
+1. Log into the client VM using ``docker exec -it "${CLIENT_CONTAINER_ID}" bash``
+2. Find the docker container of the device server (f.e. stationmanager), using ``docker ps -a | grep stationmanager``
+3. Enter the device server container with ``docker exec -it <container> bash``
+4. Install an editor, f.e. ``sudo apt-get install -y vim``
+5. Edit the relevant source file in ``/usr/local/lib/python3.10/dist-packages/tangostationcontrol`` (for Python 3.10)
+6. Call the ``restart_device_server()`` command for any device in the changed device server
+7. Once restarted, call the ``boot()`` command for all devices in the changed device server to reconfigure them
+## Login to the server
+The nomad and consul management processes run on the server, which
+is actually a docker container in the development environment.
+This container can be identified programmatically as follows. Use the additional
+``-q`` parameter to obtain just the container ID:
+```
+$ docker ps --filter 'name=server.station.nomad.nomad-cluster.jumppad.dev'
+CONTAINER ID   IMAGE                     COMMAND                  CREATED         STATUS         PORTS     NAMES
+b75f633c837e   shipyardrun/nomad:1.6.1   "/usr/bin/supervisor…"   2 minutes ago   Up 2 minutes   [...]     server.station.nomad.nomad-cluster.jumppad.dev
+$
+```
+## Using nomad: Manage jobs on the client
+The server allows you to manage the jobs on the client through Nomad. Each *job* consists of one or more *tasks* that are collectively managed. The tasks are (typically) the docker containers. A job is run inside an *allocation*, which represents an execution instance of a job.
+The nomad server spins up on http://localhost:4646, allowing interactive browsing and control. There is also a CLI however, accessed through
+```
+$ SERVER_CONTAINER_ID=$(docker ps -q --filter 'name=server.station.nomad.nomad-cluster.jumppad.dev')
+$ docker exec "${SERVER_CONTAINER_ID}" nomad
+Usage: nomad [-version] [-help] [-autocomplete-(un)install] <command> [args]
+[...]
+```
+To list the status of the jobs, use ``nomad status``:
+```
+$ docker exec "${SERVER_CONTAINER_ID}" nomad status
+ID          Type     Priority  Status   Submit Date
+connector   service  50        running  2024-02-13T13:12:02Z
+monitoring  service  50        dead     2024-02-13T13:12:09Z
+```
+To get more info about a job, use ``nomad status <job>``:
+```
+$ docker exec "${SERVER_CONTAINER_ID}" nomad status monitoring
+ID            = monitoring
+Name          = monitoring
+Submit Date   = 2024-02-13T13:12:09Z
+Type          = service
+Priority      = 50
+Datacenters   = stat
+Namespace     = default
+Node Pool     = default
+Status        = dead
+Periodic      = false
+Parameterized = false
+Summary
+Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
+grafana     0       0         0        0       1         0     0
+loki        0       0         0        1       1         0     0
+postgres    0       0         0        0       1         0     0
+prometheus  0       0         0        2       0         0     0
+Allocations
+No allocations placed
+```
+To restart a job, use ``nomad job restart <job>``:
+```
+$ docker exec -t ${SERVER_CONTAINER_ID} nomad job restart connector
+==> 2024-02-13T19:32:02Z: Restarting 1 allocation
+    2024-02-13T19:32:02Z: Restarting running tasks in allocation "e80cdf6f" for group "connector"
+==> 2024-02-13T19:32:03Z: Job restart finished
+Job restarted successfully!
+$ docker exec -t ${SERVER_CONTAINER_ID} nomad job restart monitoring
+No allocations to restart
+```
+The monitoring job cannot be restarted as it is not running.
+## Clean up lingering resources
+To clean up jumppad lingering resources:
+```
+# teardown running configuration
+.bin/jumppad down
+# remove lingering configuration
+rm -rf ~/.jumppad
+```
+A deeper clean might require:
+```
+# clear download cache
+rm .bin/jumppad
+# kill all docker processes
+docker ps -a -q | xargs docker rm -f
+# clear docker cache
+docker system prune
+# remove all volumes
+docker volume prune
+# remove all networks
+docker network prune
+```
+# FAQ
+## a network with the label id: module.nomad.resource.network.station, was not found
+Solution: jumppad is confused by a lingering configuration. Clean up any lingering
+resources (see [above](#clean-up-lingering-resources)).
+## unable to destroy resource Name: station, Type: network
+Solution: some containers are still running. Stop them, or force stop all containers using ``docker ps -q -a | xargs docker rm -f``.