diff --git a/tangostationcontrol/docs/source/alerting.rst b/tangostationcontrol/docs/source/alerting.rst index 0eeaa50ff751669cfe52b332b3de5e798a339114..bc4d576830e215b1f79a99cfeef886c60f6dbddd 100644 --- a/tangostationcontrol/docs/source/alerting.rst +++ b/tangostationcontrol/docs/source/alerting.rst @@ -3,14 +3,92 @@ Alerting We use the following setup to forward alarms: -- The Tango Controls `hdbpp subsystem <https://tango-controls.readthedocs.io/en/latest/administration/services/hdbpp/hdb++-design-guidelines.html>`_ archives data-value changes into a TimescaleDB database, +- The Tango Controls `hdbpp subsystem <https://tango-controls.readthedocs.io/en/latest/administration/services/hdbpp/hdb++-design-guidelines.html>`_ archives data-value changes into a `TimescaleDB <http://timescale.com>`_ database, - Grafana allows `Alert rules <https://grafana.com/docs/grafana/latest/alerting/>`_ to be configured, which poll TimescaleDB and generate an *alert* when the configured condition is met. It also maintains a list of currently firing alerts, - `Alerta <https://alerta.io/>`_ is the *alert manager*: itreceives these alerts, manages duplicates, and maintains alerts until the operator explicitly acknowledges them. It thus also has a list of alerts that fired in the past. +Archiving attributes +``````````````````````` + +The attributes of interest will have to be *archived* periodically to be able to see them in Grafana, and thus to be able to define alerts for them. In Tango Controls, there is an *configuration manager* that provides an interface to manage what is archived, and one or more *event subscribers* to subscribe to event changes and forward them to the archive database. + +The ``tangoncontrols.toolkit.archiver.Archiver`` class provides an easy interface to the archiver. It uses the ``device/attribute`` notation for attributes, f.e. ``STAT/SDP/1/FPGA_error_R``. Some of the functions it provides: + +:add_attribute_to_archiver(attribute, polling_period, event_period): Register the given attribute every ``polling_period`` ms. Also attribute on changes with a maximum rate of ``event_period`` ms. + +:remove_attribute_from_archiver(attribute): Unregister the given attribute. + +:start_archiving_attribute(attribute): Start archiving the given attribute. + +:stop_archiving_attribute(attribute): Stop archiving the given attribute. + +:get_attribute_errors(attribute): Return any errors detected while trying to archive the attribute. + +:get_subscriber_errors(): Return any errors detected by the subscribers. + +So a useful idiom to archive an individual attribute is:: + + from tangostationcontrol.archiver import Archiver + + archiver = Archiver() + attribute = "STAT/SDP/1/FPGA_error_R" + archiver.add_attribute_to_archiver(attribute, 1000, 1000) + archiver.start_archiving_attribute(attribute) + +.. note:: The archive subscriber gets confused if attributes it archives disappear from the monitoring database. This can cause an archive subscriber to stall. To fix this, get a proxy to the event subscriber, f.e. ``DeviceProxy("archiving/hdbppts/eventsubscriber01")``, and remove the offending attribute(s) from thr ``ArchivingList`` property using ``proxy.get_property("ArchivingList")`` and ``proxy.put_property({"ArchivingList": [...])``. + +Inspecting the database +````````````````````````` + +The archived attributes end up in a `TimescaleDB <http://timescale.com>`_ database, exposed on port 5432, with credentials ``postgres/pasword``. Key tables are: + +:att_conf: Describes which attributes are registered. Note that any device and attribute names are in lower case. + +:att_scalar_devXXX: Contains the attribute history for scalar attributes of type XXX. + +:att_array_devXXX: Contains the attribute history for 1D array attributes of type XXX. + +:att_image_devXXX: Contains the attribute history for 2D array attributes of type XXX. + +Each of the attribute history tables contains entries for any recorded value changes, but also for changes in ``quality`` (0=ok, >0=issues), and any error ``att_error_desc_id``. Futhermore, we provide specialised views which combine tables into more readable information: + +:lofar_scalar_XXX: View on the attribute history for scalar attributes of type XXX. + +:lofar_array_XXX: View on the attribute history for 1D array attributes of type XXX. Each array element is returned in its own row, with ``x`` denoting the index. + +:lofar_image_XXX: View on the attribute history for 2D array attributes of type XXX. Each array element is returned in its own row, with ``x`` and ``y`` denoting the indices. + +A typical selection could thus look like:: + + SELECT + date_time AS time, device, name, x, value + FROM lofar_array_boolean + WHERE device = 'stat/sdp/1' AND name = 'fpga_error_r' + ORDER BY time DESC + LIMIT 16 + +Attributes in Grafana +```````````````````````` + +The Grafana instance (http://localhost:3000) is linked to TimescaleDB by default. The query for plotting an attribute requires some Grafana-specific macros to select the exact data points Grafana requires:: + + SELECT + $__timeGroup(data_time, $__interval), + x::text, device, name, + value + FROM lofar_array_boolean + WHERE + $__timeFilter(data_time) AND name = 'fpga_error_r' + ORDER BY 1,2 + +The fields ``x``, ``device``, and ``name`` are retrieved as *string*, as that makes them labels to the query, which Grafana then uses to identify the different metrics for each array element. + +.. hint:: Grafana orders labels alphabetically. To order the ``x`` element properly, one could use the ``TO_CHAR(x, '00')`` function instead of ``x::text`` to prepend values with 0. + Setting up alerts ``````````````````` -To setup alerting, you first need to post-configure Grafana to populate it with alerting rules, and a policy to forward rules to Grafana: +We use the `Grafana 8+ alerts <https://grafana.com/docs/grafana/latest/alerting/>`_ to monitor our system, and the alerts are to be forwarded to our `Alerta <http://alerta.io>`_ instance. Both our default set of alerts and this forwarding has to be post-configured after installation: - Go to Grafana (http://localhost:3000) and sign in with an administration account (default: admin/admin), - Go to ``(cogwheel) -> API keys`` and create an ``editor`` API key. Copy the resulting hash, @@ -20,6 +98,31 @@ To setup alerting, you first need to post-configure Grafana to populate it with .. hint:: Whether Grafana can send alerts to Alerta can be tested by sending a `test alert <http://localhost:3000/alerting/notifications/receivers/Alerta/edit?alertmanager=grafana>`_. +The following enhancements are useful to configure for the alerts: + +- You'll want to alert on a query, followed by a ``Reduce`` step with Function ``Last`` and Mode ``Drop Non-numeric Value``. This triggers the alert on the latest value(s), but keeps the individual array elements separated, +- In ``Add details``, the ``Dashboard UID`` and ``Panel ID`` annotations are useful to configure to where you want the user to go, as Grafana will generate hyperlinks from them. To obtain a dashboard uid, go to ``Dashboards -> Browse`` and check out its URL. For the panel id, view a panel and check the URL, +- In ``Add details``, the ``Summary`` annotation will be used as the alert description, +- In ``Custom labels``, add ``severity = major`` to raise the severity of the alert (default: warning). See also the `supported values <https://docs.alerta.io/webui/configuration.html#severity-colors>`_. + +Alerta dashboard +`````````````````` + +The Alerta dashboard (http://localhost:8081) provides an overview of received alerts, which stay in the list until the alert condition disappears, and the alert is explicitly acknowledged or deleted: + +- *Acknowledging* an alert silences it for a day, +- *Shelving* an alert silences it for 2 hours, and removes it from more overviews, +- *Watching* an alert means receiving browser notifications on changes, +- *Deleting* an alert removes it until Grafana sends it again (default: 10 minutes). + +See ``docker-compose/alerta-web/alertad.conf`` for these settings. + +Several installed plugins enhance the received events: + +- ``slack`` plugin forwards alerts to Slack (see below), +- Our own ``grafana`` plugin parses Grafana-specific fields and adds them to the alert, +- Our own ``lofar`` plugin parses and generates LOFAR-specific fields. + Slack integration ```````````````````