Describe the Prometheus content, and how to use Grafana

6746d9e7 · Jan David Mol · bba08be1 · 6746d9e7 · 6746d9e7 · 6746d9e7
Commit 6746d9e7 authored Jun 1, 2022 by Jan David Mol
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -17,6 +17,9 @@ The Operations Central Management module is setup to monitor the LOFAR telescope
   :caption: Contents:

   installation
+   intro
+   monitoring
+   stack
   prometheus
   grafana


--- a/docs/source/intro.rst
+++ b/docs/source/intro.rst
@@ -3,8 +3,8 @@ Introduction

 The Operations Central Monitoring setup provides you with the following user services:

-* A *Grafana* monitoring & alerting system, exposed on http://localhost:3001,
-* A *Alerta* alarm-management system, exposed on http://localhost:8081.
+* A *Grafana* monitoring & alerting system, exposed on http://localhost:3001 (credentials: admin/admin),
+* A *Alerta* alarm-management system, exposed on http://localhost:8081 (credentials: admin/alerta).

 As well as the following backing services to support the setup:


--- a/docs/source/monitoring.rst
+++ b/docs/source/monitoring.rst
+Monitoring
+===================================
+
+The Grafana system exposed on http://localhost:3001 allows visualisation of the monitoring information collected by Prometheus (and other sources). It contains, with links to the relevant Grafana documentation:
+
+* A series of `dashboards <https://grafana.com/docs/grafana/latest/dashboards/>`_, organised into *folders*. Each dashboard is an independent page of visualisations. If you login, you will see the configured "Home" dashboard.
+* Each dashboard has a series of `panels <https://grafana.com/docs/grafana/latest/panels/>`_, often organised into collapsable *rows*. Each panel contains a specific visualisation, and can have alarms configured on them. The panels are tiled.
+* Each panel has a set of *queries*, which describe the data to be visualised, and a single *visualization*, which is how the data is visualised.
+
+The Grafana documentation will help you with using Grafana in general. Also be sure to check out the `webinars and videos <https://grafana.com/videos/>`_ provided by them.
+
+Writing Queries
+------------------------------------
+
+Most of the data will be queried from the *Prometheus* backend:
+
+* Grafana provides a `Prometheus query editor <https://grafana.com/docs/grafana/latest/datasources/prometheus/#prometheus-query-editor>`_ to interactively setup queries,
+* The queries themselves use the `PromQL <https://prometheus.io/docs/prometheus/latest/querying/basics/>`_ syntax.
+* Apart from configuring panels, you can also play with queries in the Explore tab (http://localhost:3001/explore), and directly in the Prometheus backend (http://localhost:9091).
+
+The Prometheus database is flat, containing time-series for metrics which carry a name, labels, and a float value::
+
+  attribute_name{label="value", ...} attribute_value
+
+For example::
+
+  device_attribute{host="dop496", station="DTS Outside", device="stat/sdp/1", name="FPGA_temp_R", x="03", y="00"} 42.3
+
+The queries express selections on these entries for a given name, filtered by the given labels. For example, the following query returns all FPGA temperatures across all stations, including the above entry::
+
+  device_attribute{device="stat/sdp/1", name="FPGA_temp_R"}
+
+Furthermore, values of different metrics can be combined (added, merged, etc). See the PromQL documentation for more details.
+
+Querying LOFAR Station Control
+````````````````````````````````````
+
+The `LOFAR Station Control <https://lofar20-station-control.readthedocs.io/en/latest/>`_ software exposes a series of metrics from each station:
+
+:device_attribute: All monitoring points from Tango, that are configured to be exposed to Prometheus. For arrays, each element is its own metric. It carries the following labels:
+
+  :job:       `stations`
+  :host:      Station hostname from which the value was obtained (f.e. `dts-lcu`),
+  :station:   Name of the station, as reported by the station (f.e. `DTS`) (NB: for now, the host is more reliable to use),
+  :device:    Tango device of this attribute (f.e. `stat/recv/1`),
+  :name:      Tango attribute name (f.e. `ANT_mask_RW`),
+  :type:      Data type (f.e. `string`, `float`, `bool`),
+  :x:         Offset in the first dimension, if the attribute is a 1D or 2D array, or "00",
+  :y:         Offset in the second dimension, if the attribute is a 2D array, or "00",
+  :idx:       Global offset in the array, combining `x` and `y`,
+  :str_value: The value of the attribute, if the attribute type is a string.
+
+:device_scraping: Time required to scrape each Tango device, in seconds. It carries the following labels:
+
+  :job:       `stations`
+  :host:      Station hostname from which the value was obtained (f.e. `dts-lcu`),
+  :station:   Name of the station, as reported by the station (f.e. `DTS`) (NB: for now, the host is more reliable to use),
+  :device:    Tango device scraped.
+
+Metrics from the non-Tango services are exposed as well. See the linked documentation, or use the interactive interfaces, to explore them further:
+
+:scrape\_\*: Metrics describing scraping (=Prometheus periodically requesting the metrics), see https://prometheus.io/docs/concepts/jobs_instances/.
+
+  :job:          `stations`
+  :host:         Station hostname from which the value was obtained (f.e. `dts-lcu`),
+  :exported_job: Original job on the station (`host`, `prometheus`, `grafana`).
+
+:node\_\*: Metrics describing the server, see https://github.com/prometheus/node_exporter.
+
+  :job:          `stations`
+  :host:         Station hostname from which the value was obtained (f.e. `dts-lcu`),
+  :exported_job: `host`
+
+:go\_\*, grafana\_\*: Metrics from Grafana, see https://grafana.com/docs/grafana/latest/administration/view-server/internal-metrics/ and https://grafana.com/docs/grafana/latest/alerting/unified-alerting/fundamentals/evaluate-grafana-alerts/.
+
+  :job:          `stations`
+  :host:         Station hostname from which the value was obtained (f.e. `dts-lcu`),
+  :exported_job: `grafana`
+
+Querying Operational Central Management
+````````````````````````````````````````
+
+This software stack itself also exposes metrics from its various services:
+
+
+:scrape\_\*: Metrics describing scraping (=Prometheus periodically requesting the metrics), see https://prometheus.io/docs/concepts/jobs_instances/.
+
+  :job:          `prometheus`
+
+:node\_\*: Metrics describing the server, see https://github.com/prometheus/node_exporter.
+
+  :job:          `host`
+
+:go\_\*, grafana\_\*: Metrics from Grafana, see https://grafana.com/docs/grafana/latest/administration/view-server/internal-metrics/ and https://grafana.com/docs/grafana/latest/alerting/unified-alerting/fundamentals/evaluate-grafana-alerts/.
+
+  :job:          `grafana`