Skip to content
Snippets Groups Projects
Commit 451747d4 authored by Jan David Mol's avatar Jan David Mol
Browse files

Merged prometheus and grafana pages into the developer manual, renamed monitoring to user manual

parent 5a712561
Branches
Tags
No related merge requests found
Software Stack Developer Manual
=========================================== ===========================================
The following sections describe how the software stack is setup, how it can be configured, and how to interact with it at a lower level. The following sections describe how the software stack is setup, how it can be configured, and how to interact with it at a lower level.
...@@ -36,17 +36,37 @@ Prometheus is our *time-series database*, and fulfills several roles: ...@@ -36,17 +36,37 @@ Prometheus is our *time-series database*, and fulfills several roles:
* Scraping (collecting) metrics periodically from across our instrument, * Scraping (collecting) metrics periodically from across our instrument,
* Running queries against the time-series database. * Running queries against the time-series database.
Configuration
"""""""""""""""""""""""""""""""""""""""""""
The ``prometheus-central/prometheus.yml`` configuration file configures our instance to: The ``prometheus-central/prometheus.yml`` configuration file configures our instance to:
* Periodically scrape the Prometheus metrics from the Prometheus installations on each station, using a `federation <https://prometheus.io/docs/prometheus/latest/federation/#federation>`_, * Periodically scrape the Prometheus metrics from the Prometheus installations on each station, using a `federation <https://prometheus.io/docs/prometheus/latest/federation/#federation>`_,
* Periodically scrape the metrics from the services that are part of this software package, * Periodically scrape the metrics from the services that are part of this software package,
* Periodically scrape any other metric source that is offered in the Prometheus format. * Periodically scrape any other metric source that is offered in the Prometheus format.
The scraped metrics are annotated with a ``host`` label denoting where the metric came from. This replaces the ``host=localhost`` label coming from station metrics. The full configuration is as follows:
.. literalinclude:: ../../prometheus-central/prometheus.yml
Furthermore, ``prometheus-central/Dockerfile`` configures: Furthermore, ``prometheus-central/Dockerfile`` configures:
* The `retention` of the archive, or for how long/how much data will be stored. See also https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects, * The `retention` of the archive, or for how long/how much data will be stored. See also https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects,
* Where the data are stored (in conjunction with the paths mounted in ``prometheus-central.yml``). * Where the data are stored (in conjunction with the paths mounted in ``prometheus-central.yml``).
Monitoring
"""""""""""""""""""""""""""""""""""""""""""
Prometheus also collects metrics about itself, f.e. the performance of the configured scrape jobs::
# list how many samples could be scraped from each end point
scrape_samples_scraped{exported_job=""}
# list a report of the scraping duration for each configured end point
scrape_duration_seconds{exported_job=""}
NB: The ``exported_job=""`` filter is needed to avoid returning the values from Prometheus instances on the stations about their scraping.
Grafana Grafana
------------------------------------------- -------------------------------------------
...@@ -89,9 +109,16 @@ The alerts must be configured to be forwarded to Alerta as follows: ...@@ -89,9 +109,16 @@ The alerts must be configured to be forwarded to Alerta as follows:
* Add a Contact point ``Alerta`` with the following settings: * Add a Contact point ``Alerta`` with the following settings:
+ ``Contact point type`` is ``Webhook``, + ``Contact point type`` is ``Webhook``,
+ ``Url`` is ``http://alerta-server:8080/api/webhooks/prometheus?api-key=demo-key``. + ``Url`` is ``http://alerta-server:8080/api/webhooks/prometheus?api-key=demo-key``. This API key is configured as ``ADMIN_KEY`` in ``alerta.yml``.
.. hint:: Whether Grafana can send alerts to Alerta can be tested by sending a Test alert on the Contact point page.
* In Notification policies, modify the *Root policy* to:
+ ``Default contact point`` is ``Alerta``,
+ Under ``Timing options``, set ``Group wait`` = 10s, ``Group interval`` = 10s, ``Repeat interval`` = 10m.
This API key is configured as ``ADMIN_KEY`` in ``alerta.yml``. The faster Group times result in a lower latency of alerts being sent, and the faster Repeat interval means any lost or deleted alarms get resent earlier (than the default 4 hours).
Monitoring Monitoring
""""""""""""""""""""""""""""""""""""""""""" """""""""""""""""""""""""""""""""""""""""""
...@@ -126,7 +153,7 @@ The Alerta server receives the alerts from Grafana, and routes them through seve ...@@ -126,7 +153,7 @@ The Alerta server receives the alerts from Grafana, and routes them through seve
* Our ``lofar-plugin`` processes LOFAR-specific properties, such as pulling metadata from the device attributes we use in Prometheus, * Our ``lofar-plugin`` processes LOFAR-specific properties, such as pulling metadata from the device attributes we use in Prometheus,
* The ``slack`` plugin posts messages on our Slack instance. * The ``slack`` plugin posts messages on our Slack instance.
For info on externallyd developed plugins, see also the `alerta-contrib <https://github.com/alerta/alerta-contrib>`_ repo. For info on externally developed plugins, see also the `alerta-contrib <https://github.com/alerta/alerta-contrib>`_ repo.
Slack plugin Slack plugin
""""""""""""""""""""""""""""""""""""""""""" """""""""""""""""""""""""""""""""""""""""""
......
Grafana: Monitoring Dashboards
------------------------------------------
We use `Grafana <https://grafana.com/docs/grafana/latest/introduction/>`_ to visualise the monitoring information through a series of *dashboards*. It allows us to:
* Interactively create sets of plots (*panels*) of monitoring points, visualised in various ways (including instrument diagrams),
* Have access to a wide variety of data sources,
* Add *alerts* to trigger on monitoring point formulas reaching a certain treshhold.
Configuration
`````````````````````````````````
Grafana comes with preinstalled datasources and dashboards, provided in the ``grafana-central/`` directory. By default, the following datasources are configured:
* *Prometheus* (default), providing almost all monitoring metrics,
* *Alerta UI*, providing state from the Alerta Alertmanager (see the `Alerta ReST API <https://docs.alerta.io/api/reference.html>`_),
* *Grafana API*, providing access to Grafana's API (see f.e. the `Grafana Alerting ReST API <https://editor.swagger.io/?url=https://raw.githubusercontent.com/grafana/grafana/main/pkg/services/ngalert/api/tooling/post.json>`_).
Using Grafana
`````````````````````````````````
Go to http://localhost:3001 to access the Grafana instance. The default guest access allows looking at dashboards and inspecting the data in the datasources manually. To create or edit dashboards, or change settings, you need to Sign In. The default credentials are ``admin/admin``.
Adding alerts
`````````````````````````````````
We use the `Grafana 8+ alerts <https://grafana.com/docs/grafana/latest/alerting/>`_ to monitor our system. You can add alerts to panels, or add free-floating ones under the ``(alarm bell) -> Alert rules`` menu, which is also used to browse the state of the existing alerts. Some tips:
* Select the *Alert groups* tab to filter alerts or apply custom grouping, for example, by station or by component.
Forwarding alerts to Alerta
`````````````````````````````````
The alerts in Grafana come and go, without leaving a track record of ever having been there. To keep track of alerts, we forward them to our Alerta instance. This fowarding has to be configured manually:
- Go to Grafana (http://localhost:3001) and sign in with an administration account (default: ``admin/admin``),
- In the left menubar, go to ``(alarm bell) -> Admin``, paste the following configuration, and press ``Save``:
.. literalinclude:: ../../grafana-central/alerting.json
.. hint:: Whether Grafana can send alerts to Alerta can be tested by sending a `test alert <http://localhost:3001/alerting/notifications/receivers/Alerta/edit?alertmanager=grafana>`_.
...@@ -18,10 +18,8 @@ The Operations Central Management module is setup to monitor the LOFAR telescope ...@@ -18,10 +18,8 @@ The Operations Central Management module is setup to monitor the LOFAR telescope
installation installation
intro intro
monitoring user_manual
stack dev_manual
prometheus
grafana
......
Introduction Introduction
===================================== =====================================
The Operations Central Monitoring setup provides you with the following user services: The Operations Central Monitoring setup collects monitoring information from across the instrument, and provides monitoring dashboards as well as an alarm-management system on top. It provides you with the following user services:
* A *Grafana* monitoring & alerting system, exposed on http://localhost:3001 (credentials: admin/admin), * A *Grafana* monitoring & alerting system, exposed on http://localhost:3001 (credentials: admin/admin),
* A *Alerta* alarm-management system, exposed on http://localhost:8081 (credentials: admin/alerta). * A *Alerta* alarm-management system, exposed on http://localhost:8081 (credentials: admin/alerta).
As well as the following backing services to support the setup: As well as the following backing services to support the setup:
* A *Prometheus* database that collects monitoring information of the instrument, exposed on http://localhost:9091, * A *Prometheus* database that collects monitoring information from across the instrument, exposed on http://localhost:9091,
* A *Node Exporter* scraper that collects monitoring informatino of the host running this software stack, exposed on http://localhost:9100. * A *Node Exporter* scraper that collects monitoring information of the host running this software stack, exposed on http://localhost:9100.
.. hint:: The URLs assume you're running this software on localhost. Replace this with the hostname of the hosting system if you're accessing this software on a server. .. hint:: The URLs assume you're running this software on localhost. Replace this with the hostname of the hosting system if you're accessing this software on a server.
......
Prometheus: Aggregating Monitoring Data
------------------------------------------
We use `Prometheus <https://prometheus.io/docs/introduction/overview/>`_ to *scrape* monitoring data ("metrics") from across the telescope, and collect it into a single time-series database. Our Prometheus instance is running as the ``prometheus-central`` docker container, which periodically (every 10-60s) obtains metrics from the configured end points. This setup has several advantages:
* Easy to setup. The end points only have to provide a plain text HTTP interface that serves the current state of their metrics.
* Robustness. If scraping somehow errors or times out, Prometheus will simply retry in the next scraping period.
* Widespread support. Many open-source packages already provide a Prometheus metrics end point out of the box.
Configuration
`````````````````````````````````
The scraping configuration is provided in ``prometheus-central/prometheus.yml``:
.. literalinclude:: ../../prometheus-central/prometheus.yml
The following end points are scraped:
* Stations. Metrics are obtained from their local Prometheus instance using a `Prometheus Federation <https://prometheus.io/docs/prometheus/latest/federation/>`_ setup.
* Local services. Metrics from fellow Operations Central Management containers are scraped as well.
* Local machine. Metrics from the machine running our containers is scraped (provided by the ``prometheus-node-exporter`` container).
Inspection in Prometheus
`````````````````````````````````
The Prometheus server provides a direct interface on http://localhost:9091 to query the database. PromQL allows you to specify which metric(s) to view, combine, filter, scale, etc. Some general statistics about the scraping process are provided by the following queries::
# list how many samples could be scraped from each end point
scrape_samples_scraped
# list a report of the scraping duration for each configured end point
scrape_duration_seconds
NB: The timestamp(s) for which the data is requested is configured in a side channel. In this interface, it's a time picker defaulting to "now".
Metrics and queries
`````````````````````````````````
Prometheus stores each value as an independent metric, identified by a series *name*, string key-value *labels*, and a floar or integer value, for example (see also the `Prometheus Data Model <https://prometheus.io/docs/concepts/data_model/>`_)::
scrape_samples_scraped{host="dop496", instance="dop496.astron.nl:9090", job="stations"} 551
Selecting metrics can be done using `PromQL <https://prometheus.io/docs/prometheus/latest/querying/basics/>`_ querying language. For example, ``scrape_samples_scraped{job="stations"}`` would only the scrape statistics for the metrics in the ``stations`` scrape job.
Monitoring User Manual
=================================== ===================================
The Grafana system exposed on http://localhost:3001 allows visualisation of the monitoring information collected by Prometheus (and other sources). It contains, with links to the relevant Grafana documentation: The Grafana system exposed on http://localhost:3001 allows visualisation of the monitoring information collected by Prometheus (and other sources). It contains, with links to the relevant Grafana documentation:
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment