- The Tango Controls `hdbpp subsystem <https://tango-controls.readthedocs.io/en/latest/administration/services/hdbpp/hdb++-design-guidelines.html>`_ archives data-value changes into a TimescaleDB database,
- The Tango Controls `hdbpp subsystem <https://tango-controls.readthedocs.io/en/latest/administration/services/hdbpp/hdb++-design-guidelines.html>`_ archives data-value changes into a `TimescaleDB <http://timescale.com>`_ database,
- Grafana allows `Alert rules <https://grafana.com/docs/grafana/latest/alerting/>`_ to be configured, which poll TimescaleDB and generate an *alert* when the configured condition is met. It also maintains a list of currently firing alerts,
- Grafana allows `Alert rules <https://grafana.com/docs/grafana/latest/alerting/>`_ to be configured, which poll TimescaleDB and generate an *alert* when the configured condition is met. It also maintains a list of currently firing alerts,
- `Alerta <https://alerta.io/>`_ is the *alert manager*: itreceives these alerts, manages duplicates, and maintains alerts until the operator explicitly acknowledges them. It thus also has a list of alerts that fired in the past.
- `Alerta <https://alerta.io/>`_ is the *alert manager*: itreceives these alerts, manages duplicates, and maintains alerts until the operator explicitly acknowledges them. It thus also has a list of alerts that fired in the past.
Archiving attributes
```````````````````````
The attributes of interest will have to be *archived* periodically to be able to see them in Grafana, and thus to be able to define alerts for them. In Tango Controls, there is an *configuration manager* that provides an interface to manage what is archived, and one or more *event subscribers* to subscribe to event changes and forward them to the archive database.
The ``tangoncontrols.toolkit.archiver.Archiver`` class provides an easy interface to the archiver. It uses the ``device/attribute`` notation for attributes, f.e. ``STAT/SDP/1/FPGA_error_R``. Some of the functions it provides:
:add_attribute_to_archiver(attribute, polling_period, event_period): Register the given attribute every ``polling_period`` ms. Also attribute on changes with a maximum rate of ``event_period`` ms.
:remove_attribute_from_archiver(attribute): Unregister the given attribute.
:start_archiving_attribute(attribute): Start archiving the given attribute.
:stop_archiving_attribute(attribute): Stop archiving the given attribute.
:get_attribute_errors(attribute): Return any errors detected while trying to archive the attribute.
:get_subscriber_errors(): Return any errors detected by the subscribers.
So a useful idiom to archive an individual attribute is::
.. note:: The archive subscriber gets confused if attributes it archives disappear from the monitoring database. This can cause an archive subscriber to stall. To fix this, get a proxy to the event subscriber, f.e. ``DeviceProxy("archiving/hdbppts/eventsubscriber01")``, and remove the offending attribute(s) from thr ``ArchivingList`` property using ``proxy.get_property("ArchivingList")`` and ``proxy.put_property({"ArchivingList": [...])``.
Inspecting the database
`````````````````````````
The archived attributes end up in a `TimescaleDB <http://timescale.com>`_ database, exposed on port 5432, with credentials ``postgres/pasword``. Key tables are:
:att_conf: Describes which attributes are registered. Note that any device and attribute names are in lower case.
:att_scalar_devXXX: Contains the attribute history for scalar attributes of type XXX.
:att_array_devXXX: Contains the attribute history for 1D array attributes of type XXX.
:att_image_devXXX: Contains the attribute history for 2D array attributes of type XXX.
Each of the attribute history tables contains entries for any recorded value changes, but also for changes in ``quality`` (0=ok, >0=issues), and any error ``att_error_desc_id``. Futhermore, we provide specialised views which combine tables into more readable information:
:lofar_scalar_XXX: View on the attribute history for scalar attributes of type XXX.
:lofar_array_XXX: View on the attribute history for 1D array attributes of type XXX. Each array element is returned in its own row, with ``x`` denoting the index.
:lofar_image_XXX: View on the attribute history for 2D array attributes of type XXX. Each array element is returned in its own row, with ``x`` and ``y`` denoting the indices.
A typical selection could thus look like::
SELECT
date_time AS time, device, name, x, value
FROM lofar_array_boolean
WHERE device = 'stat/sdp/1' AND name = 'fpga_error_r'
ORDER BY time DESC
LIMIT 16
Attributes in Grafana
````````````````````````
The Grafana instance (http://localhost:3000) is linked to TimescaleDB by default. The query for plotting an attribute requires some Grafana-specific macros to select the exact data points Grafana requires::
SELECT
$__timeGroup(data_time, $__interval),
x::text, device, name,
value
FROM lofar_array_boolean
WHERE
$__timeFilter(data_time) AND name = 'fpga_error_r'
ORDER BY 1,2
The fields ``x``, ``device``, and ``name`` are retrieved as *string*, as that makes them labels to the query, which Grafana then uses to identify the different metrics for each array element.
.. hint:: Grafana orders labels alphabetically. To order the ``x`` element properly, one could use the ``TO_CHAR(x, '00')`` function instead of ``x::text`` to prepend values with 0.
Setting up alerts
Setting up alerts
```````````````````
```````````````````
To setup alerting, you first need to post-configure Grafana to populate it with alerting rules, and a policy to forward rules to Grafana:
We use the `Grafana 8+ alerts <https://grafana.com/docs/grafana/latest/alerting/>`_ to monitor our system, and the alerts are to be forwarded to our `Alerta <http://alerta.io>`_ instance. Both our default set of alerts and this forwarding has to be post-configured after installation:
- Go to Grafana (http://localhost:3000) and sign in with an administration account (default: admin/admin),
- Go to Grafana (http://localhost:3000) and sign in with an administration account (default: admin/admin),
- Go to ``(cogwheel) -> API keys`` and create an ``editor`` API key. Copy the resulting hash,
- Go to ``(cogwheel) -> API keys`` and create an ``editor`` API key. Copy the resulting hash,
...
@@ -20,6 +98,31 @@ To setup alerting, you first need to post-configure Grafana to populate it with
...
@@ -20,6 +98,31 @@ To setup alerting, you first need to post-configure Grafana to populate it with
.. hint:: Whether Grafana can send alerts to Alerta can be tested by sending a `test alert <http://localhost:3000/alerting/notifications/receivers/Alerta/edit?alertmanager=grafana>`_.
.. hint:: Whether Grafana can send alerts to Alerta can be tested by sending a `test alert <http://localhost:3000/alerting/notifications/receivers/Alerta/edit?alertmanager=grafana>`_.
The following enhancements are useful to configure for the alerts:
- You'll want to alert on a query, followed by a ``Reduce`` step with Function ``Last`` and Mode ``Drop Non-numeric Value``. This triggers the alert on the latest value(s), but keeps the individual array elements separated,
- In ``Add details``, the ``Dashboard UID`` and ``Panel ID`` annotations are useful to configure to where you want the user to go, as Grafana will generate hyperlinks from them. To obtain a dashboard uid, go to ``Dashboards -> Browse`` and check out its URL. For the panel id, view a panel and check the URL,
- In ``Add details``, the ``Summary`` annotation will be used as the alert description,
- In ``Custom labels``, add ``severity = major`` to raise the severity of the alert (default: warning). See also the `supported values <https://docs.alerta.io/webui/configuration.html#severity-colors>`_.
Alerta dashboard
``````````````````
The Alerta dashboard (http://localhost:8081) provides an overview of received alerts, which stay in the list until the alert condition disappears, and the alert is explicitly acknowledged or deleted:
- *Acknowledging* an alert silences it for a day,
- *Shelving* an alert silences it for 2 hours, and removes it from more overviews,
- *Watching* an alert means receiving browser notifications on changes,
- *Deleting* an alert removes it until Grafana sends it again (default: 10 minutes).
See ``docker-compose/alerta-web/alertad.conf`` for these settings.
Several installed plugins enhance the received events:
- ``slack`` plugin forwards alerts to Slack (see below),
- Our own ``grafana`` plugin parses Grafana-specific fields and adds them to the alert,
- Our own ``lofar`` plugin parses and generates LOFAR-specific fields.