Merged prometheus and grafana pages into the developer manual, renamed monitoring to user manual

451747d4 · Jan David Mol · 5a712561 · 451747d4 · 5a712561 · 451747d4
Commit 451747d4 authored Jun 28, 2022 by Jan David Mol
--- a/docs/source/stack.rst
+++ b/docs/source/stack.rst
-Software Stack
+Developer Manual
 ===========================================
 The following sections describe how the software stack is setup, how it can be configured, and how to interact with it at a lower level.
@@ -36,17 +36,37 @@ Prometheus is our *time-series database*, and fulfills several roles:
 * Scraping (collecting) metrics periodically from across our instrument,
 * Running queries against the time-series database.
+Configuration
+"""""""""""""""""""""""""""""""""""""""""""
 The ``prometheus-central/prometheus.yml`` configuration file configures our instance to:
 * Periodically scrape the Prometheus metrics from the Prometheus installations on each station, using a `federation <https://prometheus.io/docs/prometheus/latest/federation/#federation>`_,
 * Periodically scrape the metrics from the services that are part of this software package,
 * Periodically scrape any other metric source that is offered in the Prometheus format.
+The scraped metrics are annotated with a ``host`` label denoting where the metric came from. This replaces the ``host=localhost`` label coming from station metrics. The full configuration is as follows:
+.. literalinclude:: ../../prometheus-central/prometheus.yml
 Furthermore, ``prometheus-central/Dockerfile`` configures:
 * The `retention` of the archive, or for how long/how much data will be stored. See also https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects,
 * Where the data are stored (in conjunction with the paths mounted in ``prometheus-central.yml``).
+Monitoring
+"""""""""""""""""""""""""""""""""""""""""""
+Prometheus also collects metrics about itself, f.e. the performance of the configured scrape jobs::
+  # list how many samples could be scraped from each end point
+  scrape_samples_scraped{exported_job=""}
+  # list a report of the scraping duration for each configured end point
+  scrape_duration_seconds{exported_job=""}
+NB: The ``exported_job=""`` filter is needed to avoid returning the values from Prometheus instances on the stations about their scraping.
 Grafana
 -------------------------------------------
@@ -89,9 +109,16 @@ The alerts must be configured to be forwarded to Alerta as follows:
 * Add a Contact point ``Alerta`` with the following settings:
  + ``Contact point type`` is ``Webhook``,
-  + ``Url`` is ``http://alerta-server:8080/api/webhooks/prometheus?api-key=demo-key``.
+  + ``Url`` is ``http://alerta-server:8080/api/webhooks/prometheus?api-key=demo-key``. This API key is configured as ``ADMIN_KEY`` in ``alerta.yml``.
+.. hint:: Whether Grafana can send alerts to Alerta can be tested by sending a Test alert on the Contact point page.
+* In Notification policies, modify the *Root policy* to:
+  + ``Default contact point`` is ``Alerta``,
+  + Under ``Timing options``, set ``Group wait`` = 10s, ``Group interval`` = 10s, ``Repeat interval`` = 10m.
-This API key is configured as ``ADMIN_KEY`` in ``alerta.yml``.
+The faster Group times result in a lower latency of alerts being sent, and the faster Repeat interval means any lost or deleted alarms get resent earlier (than the default 4 hours).
 Monitoring
 """""""""""""""""""""""""""""""""""""""""""
@@ -126,7 +153,7 @@ The Alerta server receives the alerts from Grafana, and routes them through seve
 * Our ``lofar-plugin`` processes LOFAR-specific properties, such as pulling metadata from the device attributes we use in Prometheus,
 * The ``slack`` plugin posts messages on our Slack instance.
-For info on externallyd developed plugins, see also the `alerta-contrib <https://github.com/alerta/alerta-contrib>`_ repo.
+For info on externally developed plugins, see also the `alerta-contrib <https://github.com/alerta/alerta-contrib>`_ repo.
 Slack plugin
 """""""""""""""""""""""""""""""""""""""""""

--- a/docs/source/grafana.rst
+++ b/docs/source/grafana.rst
-Grafana: Monitoring Dashboards
------------------------------------------
-We use `Grafana <https://grafana.com/docs/grafana/latest/introduction/>`_ to visualise the monitoring information through a series of *dashboards*. It allows us to:
-* Interactively create sets of plots (*panels*) of monitoring points, visualised in various ways (including instrument diagrams),
-* Have access to a wide variety of data sources,
-* Add *alerts* to trigger on monitoring point formulas reaching a certain treshhold.
-Configuration
-`````````````````````````````````
-Grafana comes with preinstalled datasources and dashboards, provided in the ``grafana-central/`` directory. By default, the following datasources are configured:
-* *Prometheus* (default), providing almost all monitoring metrics,
-* *Alerta UI*, providing state from the Alerta Alertmanager (see the `Alerta ReST API <https://docs.alerta.io/api/reference.html>`_),
-* *Grafana API*, providing access to Grafana's API (see f.e. the `Grafana Alerting ReST API <https://editor.swagger.io/?url=https://raw.githubusercontent.com/grafana/grafana/main/pkg/services/ngalert/api/tooling/post.json>`_).
-Using Grafana
-`````````````````````````````````
-Go to http://localhost:3001 to access the Grafana instance. The default guest access allows looking at dashboards and inspecting the data in the datasources manually. To create or edit dashboards, or change settings, you need to Sign In. The default credentials are ``admin/admin``.
-Adding alerts
-`````````````````````````````````
-We use the `Grafana 8+ alerts <https://grafana.com/docs/grafana/latest/alerting/>`_ to monitor our system. You can add alerts to panels, or add free-floating ones under the ``(alarm bell) -> Alert rules`` menu, which is also used to browse the state of the existing alerts. Some tips:
-* Select the *Alert groups* tab to filter alerts or apply custom grouping, for example, by station or by component.
-Forwarding alerts to Alerta
-`````````````````````````````````
-The alerts in Grafana come and go, without leaving a track record of ever having been there. To keep track of alerts, we forward them to our Alerta instance. This fowarding has to be configured manually:
- Go to Grafana (http://localhost:3001) and sign in with an administration account (default: ``admin/admin``),
- In the left menubar, go to ``(alarm bell) -> Admin``, paste the following configuration, and press ``Save``:
-.. literalinclude:: ../../grafana-central/alerting.json
-.. hint:: Whether Grafana can send alerts to Alerta can be tested by sending a `test alert <http://localhost:3001/alerting/notifications/receivers/Alerta/edit?alertmanager=grafana>`_.
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -18,10 +18,8 @@ The Operations Central Management module is setup to monitor the LOFAR telescope
   installation
   intro
-   monitoring
+   user_manual
-   stack
+   dev_manual
-   prometheus
-   grafana

--- a/docs/source/intro.rst
+++ b/docs/source/intro.rst
 Introduction
 =====================================
-The Operations Central Monitoring setup provides you with the following user services:
+The Operations Central Monitoring setup collects monitoring information from across the instrument, and provides monitoring dashboards as well as an alarm-management system on top. It provides you with the following user services:
 * A *Grafana* monitoring & alerting system, exposed on http://localhost:3001 (credentials: admin/admin),
 * A *Alerta* alarm-management system, exposed on http://localhost:8081 (credentials: admin/alerta).
 As well as the following backing services to support the setup:
-* A *Prometheus* database that collects monitoring information of the instrument, exposed on http://localhost:9091,
+* A *Prometheus* database that collects monitoring information from across the instrument, exposed on http://localhost:9091,
-* A *Node Exporter* scraper that collects monitoring informatino of the host running this software stack, exposed on http://localhost:9100.
+* A *Node Exporter* scraper that collects monitoring information of the host running this software stack, exposed on http://localhost:9100.
 .. hint:: The URLs assume you're running this software on localhost. Replace this with the hostname of the hosting system if you're accessing this software on a server.

--- a/docs/source/prometheus.rst
+++ b/docs/source/prometheus.rst
-Prometheus: Aggregating Monitoring Data
------------------------------------------
-We use `Prometheus <https://prometheus.io/docs/introduction/overview/>`_ to *scrape* monitoring data ("metrics") from across the telescope, and collect it into a single time-series database. Our Prometheus instance is running as the ``prometheus-central`` docker container, which periodically (every 10-60s) obtains metrics from the configured end points. This setup has several advantages:
-* Easy to setup. The end points only have to provide a plain text HTTP interface that serves the current state of their metrics.
-* Robustness. If scraping somehow errors or times out, Prometheus will simply retry in the next scraping period.
-* Widespread support. Many open-source packages already provide a Prometheus metrics end point out of the box.
-Configuration
-`````````````````````````````````
-The scraping configuration is provided in ``prometheus-central/prometheus.yml``:
-.. literalinclude:: ../../prometheus-central/prometheus.yml
-The following end points are scraped:
-* Stations. Metrics are obtained from their local Prometheus instance using a `Prometheus Federation <https://prometheus.io/docs/prometheus/latest/federation/>`_ setup.
-* Local services. Metrics from fellow Operations Central Management containers are scraped as well.
-* Local machine. Metrics from the machine running our containers is scraped (provided by the ``prometheus-node-exporter`` container).
-Inspection in Prometheus
-`````````````````````````````````
-The Prometheus server provides a direct interface on http://localhost:9091 to query the database. PromQL allows you to specify which metric(s) to view, combine, filter, scale, etc. Some general statistics about the scraping process are provided by the following queries::
-  # list how many samples could be scraped from each end point
-  scrape_samples_scraped
-  # list a report of the scraping duration for each configured end point
-  scrape_duration_seconds
-NB: The timestamp(s) for which the data is requested is configured in a side channel. In this interface, it's a time picker defaulting to "now".
-Metrics and queries
-`````````````````````````````````
-Prometheus stores each value as an independent metric, identified by a series *name*, string key-value *labels*, and a floar or integer value, for example (see also the `Prometheus Data Model <https://prometheus.io/docs/concepts/data_model/>`_)::
-  scrape_samples_scraped{host="dop496", instance="dop496.astron.nl:9090", job="stations"} 551
-Selecting metrics can be done using `PromQL <https://prometheus.io/docs/prometheus/latest/querying/basics/>`_ querying language. For example, ``scrape_samples_scraped{job="stations"}`` would only the scrape statistics for the metrics in the ``stations`` scrape job.
--- a/docs/source/monitoring.rst
+++ b/docs/source/monitoring.rst
-Monitoring
+User Manual
 ===================================
 The Grafana system exposed on http://localhost:3001 allows visualisation of the monitoring information collected by Prometheus (and other sources). It contains, with links to the relevant Grafana documentation: