User Manual
===================================

The Grafana system exposed on http://localhost:3001 allows visualisation of the monitoring information collected by Prometheus (and other sources). It contains, with links to the relevant Grafana documentation:

* A series of `dashboards <https://grafana.com/docs/grafana/latest/dashboards/>`_, organised into *folders*. Each dashboard is an independent page of visualisations. If you login, you will see the configured "Home" dashboard.
* Each dashboard has a series of `panels <https://grafana.com/docs/grafana/latest/panels/>`_, often organised into collapsable *rows*. Each panel contains a specific visualisation, and can have alarms configured on them. The panels are tiled.
* Each panel has a set of *queries*, which describe the data to be visualised, and a single *visualization*, which is how the data is visualised.

The Grafana documentation will help you with using Grafana in general. Also useful are the following videos and posts:

* `Grafana Dashboard: Monitor CPU, Memory, Disk and Network Traffic Using Prometheus and Node Exporter <https://www.youtube.com/watch?v=YUabB_7H710>`_ [video, 26m], explains how to build dashboards in Grafana,
* `Guide to Grafana 101: Getting Started With (Awesome) Visualizations <https://www.youtube.com/watch?v=oPumWaoNw5s&t=1546s>`_ [video, 37m], explains setting up visualisations in Grafana (using a TimescaleDB data source),
* `Guide to Grafana 101 <https://www.youtube.com/playlist?list=PLsceB9ac9MHTjwvV18QJnPcLrTXm_Q-Ft>`_ [videos], series of videos explaining both beginner and advanced topics,
* `How to build a Prometheus query in Grafana <https://www.youtube.com/watch?v=5VrjOzIOJPw>`_ [video, 4m], explains how to use Prometheus in Grafana,
* `Using query builders for Grafana Loki and Prometheus <https://university.grafana.com/learn/course/83/play/711/video-using-query-builders-for-grafana-loki-and-prometheus;lp=22>`_ [e-learning, 15m], interactively helps you use the Grafana query builders to create Prometheus queries in Grafana,
* `Grafana Alerting: Explore our latest updates in Grafana 9 <https://grafana.com/blog/2022/06/14/grafana-alerting-explore-our-latest-updates-in-grafana-9/>`_ [blog, 6m], explains how Alerting works in Grafana,
* `FlowCharting Getting Started <https://algenty.github.io/flowcharting-repository/STARTED.html>`_ [blog], explains how to use the FlowCharting plugin to animate draw.io drawings in Grafana.

Finally, be sure to check out the `webinars and videos <https://grafana.com/videos/>`_ provided by the Grafana team.

Writing Queries
------------------------------------

Most of the data will be queried from the *Prometheus* backend:

* Grafana provides a `Prometheus query editor <https://grafana.com/docs/grafana/latest/datasources/prometheus/#prometheus-query-editor>`_ to interactively setup queries,
* The queries themselves use the `PromQL <https://prometheus.io/docs/prometheus/latest/querying/basics/>`_ syntax.
* Apart from configuring panels, you can also play with queries in the Explore tab (http://localhost:3001/explore), and directly in the Prometheus backend (http://localhost:9091).

The Prometheus database is flat, containing time-series for metrics which carry a name, labels, and a float value::

  attribute_name{label="value", ...} attribute_value

For example::

  device_attribute{host="dop496", station="DTS Outside", device="stat/sdp/1", name="FPGA_temp_R", x="03", y="00"} 42.3

The queries express selections on these entries for a given name, filtered by the given labels. For example, the following query returns all FPGA temperatures across all stations, including the above entry::

  device_attribute{device="stat/sdp/1", name="FPGA_temp_R"}

Furthermore, values of different metrics can be combined (added, merged, etc). See the PromQL documentation for more details, or read:

* `An Intro to PromQL <https://logz.io/blog/promql-examples-introduction/>`_,
* `PromQL Tutorial for Beginners <https://valyala.medium.com/promql-tutorial-for-beginners-9ab455142085>`_,
* `PromQL Cheat Sheet <https://promlabs.com/promql-cheat-sheet/>`_.

Querying LOFAR Station Control
````````````````````````````````````

The `LOFAR Station Control <https://lofar20-station-control.readthedocs.io/en/latest/>`_ software exposes a series of metrics from each station:

:device_attribute: All monitoring points from Tango, that are configured to be exposed to Prometheus. For arrays, each element is its own metric. It carries the following labels:

  :job:       `stations`
  :host:      Station hostname from which the value was obtained (f.e. `dts-lcu`),
  :station:   Name of the station, as reported by the station (f.e. `DTS`) (NB: for now, the host is more reliable to use),
  :device:    Tango device of this attribute (f.e. `stat/recv/1`),
  :name:      Tango attribute name (f.e. `ANT_mask_RW`),
  :type:      Data type (f.e. `string`, `float`, `bool`),
  :x:         Offset in the first dimension, if the attribute is a 1D or 2D array, or "00",
  :y:         Offset in the second dimension, if the attribute is a 2D array, or "00",
  :idx:       Global offset in the array, combining `x` and `y`,
  :str_value: The value of the attribute, if the attribute type is a string.

:device_scraping: Time required to scrape each Tango device, in seconds. It carries the following labels:

  :job:       `stations`
  :host:      Station hostname from which the value was obtained (f.e. `dts-lcu`),
  :station:   Name of the station, as reported by the station (f.e. `DTS`) (NB: for now, the host is more reliable to use),
  :device:    Tango device scraped.

Metrics from the non-Tango services are exposed as well. See the linked documentation, or use the interactive interfaces, to explore them further:

:scrape\_\*: Metrics describing scraping (=Prometheus periodically requesting the metrics), see https://prometheus.io/docs/concepts/jobs_instances/.

  :job:          `stations`
  :host:         Station hostname from which the value was obtained (f.e. `dts-lcu`),
  :exported_job: Original job on the station (`host`, `prometheus`, `grafana`).

:node\_\*: Metrics describing the server, see https://github.com/prometheus/node_exporter.

  :job:          `stations`
  :host:         Station hostname from which the value was obtained (f.e. `dts-lcu`),
  :exported_job: `host`

:go\_\*, grafana\_\*: Metrics from Grafana, see https://grafana.com/docs/grafana/latest/administration/view-server/internal-metrics/ and https://grafana.com/docs/grafana/latest/alerting/unified-alerting/fundamentals/evaluate-grafana-alerts/.

  :job:          `stations`
  :host:         Station hostname from which the value was obtained (f.e. `dts-lcu`),
  :exported_job: `grafana`

Querying Operational Central Management
````````````````````````````````````````

This software stack itself also exposes metrics from its various services:


:scrape\_\*: Metrics describing scraping (=Prometheus periodically requesting the metrics), see https://prometheus.io/docs/concepts/jobs_instances/.

  :job:          `prometheus`

:node\_\*: Metrics describing the server, see https://github.com/prometheus/node_exporter.

  :job:          `host`

:go\_\*, grafana\_\*: Metrics from Grafana, see https://grafana.com/docs/grafana/latest/administration/view-server/internal-metrics/ and https://grafana.com/docs/grafana/latest/alerting/unified-alerting/fundamentals/evaluate-grafana-alerts/.

  :job:          `grafana`

Query Tricks
````````````````````````````````````````

Standard deviation between non-zero elements within an array (but not over time)::

  stddev(device_attribtue{host="$station", device="stat/recv/1", name="RCU_TEMP_R"} > 0)

Standard deviation for each element individually over time (but not between elements). Using ``$__range_interval``, we ensure that the last data point in the panel covers all the data in the panel. So using an ``Instant`` query or otherwise the last value is recommended::

  stddev_over_time(device_attribtue{host="$station", device="stat/recv/1", name="RCU_TEMP_R"}[$__range_interval])

Standard deviation for each element over time, for elements that are non-zero over the full range. The ``avg_over_time(...) > bool 0`` returns 1 for elements we want, and 0 for those we don't. We use the construct ``mask * (result + 1) > 0 - 1`` to a) filter any result for which the mask is 0, and b) retain values for which the result is 0::

  (avg_over_time(device_attribute{host="$station", device="stat/recv/1", name="RCU_TEMP_R"}[$__range_interval]) > bool 0)
  * on(x,y)
  (stddev_over_time(device_attribtue{host="$station", device="stat/recv/1", name="RCU_TEMP_R"}[$__range_interval]) + 1) > 0
  - 1

Masked values. If we have an attribute which is covered by some mask, we can use the following to force values outside of the mask to 0. The following formula returns 1 only for FPGAs which have a communication error, and are enabled in ``TR_fpga_mask_R``::

  device_attribute{host="$station",device="stat/sdp/1",name="TR_fpga_communication_error_R"}
  * on(x,y) (device_attribute{host="$station",device="stat/sdp/1",name="TR_fpga_mask_R"} > bool 0)

Masked boolean values, but we also want to process values for disabled elements. If we have an attribute which has a boolean value `and` is covered by some mask, we can distinguish the 4 combinations by using the following formula, which result in 0 = False, disabled in mask, 1 = True, disabled in mask, 2 = False, enabled in mask, 3 = True, enabled in mask::

  device_attribute{host="$station",device="stat/sdp/1",name="TR_fpga_communication_error_R"}
  + on(x,y) (2 * device_attribute{host="$station",device="stat/sdp/1",name="TR_fpga_mask_R"})


Map Panels
------------------------------------

Grafana provides panels to display points on a map, where both the locations of the pins and their colours are drawn from the database. We use the `Orchestra Cities Map <https://grafana.com/grafana/plugins/orchestracities-map-panel/>`_ plugin for the best results.

The position information is best used in the `Geohash <https://www.pubnub.com/learn/glossary/what-is-geohashing/>`_ format, which encodes latitude and longitude as a single string. The station exposes the following geohash positions as attributes::

  # Position of each HBAT
  device_attribute{host="$station", device="stat/antennafield/1", name="HBAT_reference_GEOHASH_R"}

  # Position of each Antenna Field
  device_attribute{host="$station", device="stat/antennafield/1", name="Antenna_Field_Reference_GEOHASH_R"}

To use these in the Orchestra map, you need to configure the *Data Layer* as follows:

* Set `Location` to `Geohash`,
* Set `Geohash field` to `str_value`.

Furthermore, you will want to consider:

* `Base layer` to `Open Street Map`,
* `Map view` -> `Initial view` -> `View` to `Auto Center`,

To add *colours* for each dot, we need to combine the position with the value of another metric, for example::

  sum by (host, x) (device_attribute{host="$station", device="stat/antennafield/1", name="HBAT_PWR_on_R"})
  + on(host, x) group_right() (device_attribute{host="$station", device="stat/antennafield/1", name="HBAT_reference_GEOHASH_R"} * 0)

In which `HBAT_PWR_on_R` represents the power of an HBAT element, which we sum per tile over all elements using `sum by (host, x)`. The output of this, the first line, will determine the colour. The position is added by using `group_by`, which adds the critical `str_value` of the `HBAT_reference_geohash_R`. By using `+ on(...) (... * 0)`, we make sure the metric value is not influenced by the value of the geohash metric.

The colour is configured in the *Data Layer* by setting `Marker Color` to `Value`, and configuring the `Thresholds` at the bottom for the actual colour to be used for each range of values.

DrawIO Panels
------------------------------------

The `FlowCharting <https://grafana.com/grafana/plugins/agenty-flowcharting-panel/>`_ plugin allows you to create a drawing in `Draw.io <https://draw.io>`_, and replace values or colour elements according to query results.

See also the  `FlowCharting documentation <https://algenty.github.io/flowcharting-repository/STARTED.html>`_.

Advanced: SVG Panels
------------------------------------

It is possible to display an SVG picture, and have areas of it highlight based on query results. We find the `ACE.SVG <https://grafana.com/grafana/plugins/aceiot-svg-panel/>`_ plugin to be useful.

An example setup can be constructed as follows:

* Add a query with the *Alerta UI* data source, with:

  * Rename query to ``AlertaAlerts``,
  * ``URL`` set to ``http://alerta-server:8080/api/alerts``,
  * ``Rows/Root`` set to ``alerts``,
  * Add ``Query Param``: key ``status`` value ``open`` (to filter out any closed alerts).

* Add a query with the *Grafana API* data source, with:

  * Rename query to ``GrafanaAlerts``,
  * ``URL`` set to ``http://localhost:3000/api/alertmanager/grafana/api/v2/alerts``,
  * ``Rows/Root`` set to ``alerts``.

* Add a query with the *Prometheus* data source, with:

  * Rename query to ``PrometheusData``,
  * Query set to f.e. ``scrape_samples_scraped``.

* Load an SVG, with at least 2 named regions,
* Under ``SVG Mappings``, map these two regions to the names ``RegionA`` and ``RegionB``,
* Put the content of `lofar-svglib.js <https://git.astron.nl/lofar2.0/operations-central-management/-/tree/main/grafana-central/lofar-svglib.js>`_ in ``User JS Init Code``.
* Put the following code in ``User JS Render Code``::

    // find the right data series
    let series = data.series.find(
      x => x.refId == "PrometheusData"
        && x.fields[1].labels.device == "total"
    )

    // use the last value
    let buffer = series.fields[1].values.buffer
    let lastValue = buffer[buffer.length-1]

    // colour RegionA accordingly
    svgmap.RegionA.css('fill', lastValue > 1 ? '#f00' : '#0f0')

    // link it to a fixed URL
    svgmap.RegionA.linkTo(function(link) {
      link.to('http://www.google.com').target('_blank')
    })

    // lookup an alert
    alert = get_alert(data, "test")

    // colour RegionB accordingly
    svgmap.RegionB.css('fill', alert.colour)

    // link it to the alert URL
    if (alert.href) {
      svgmap.RegionB.linkTo(function(link) {
        link.to(alert.href).target('_blank')
      })
    }

    console.log("refreshed")

Alerting
-------------------------------------------

Alerts on data are generated by Grafana, by periodically polling their underlying queries and triggering an alert once an alarm condition is met. These alerts are forwarded to *Alerta*, in which the user can manage (and annotate) them.

In Grafana, alerts can be setup in two ways:

* Attached to a panel: through the ``Alert`` tab when editing a panel,
* Free floating: through the left-hand bar under ``Alerting -> Alerting Rules``.

Each rule consists of:

* A *Query* that selects the results on which to trigger,
* *Expression(s)* that says when the alert should fire.

Grafana is capable of generating multiple *instances* of the same alert, each of which covering one of the query results. This allows us to track different sources of the same alert individually, yet grouped. To do so, we need to retain the *labels* of each result as returned by the query, as each unique set of labels results in a different instance of the alert.

As the "Classical condition" Expression advised by Grafana drops all labels, we need to do something different:

* A Query ``A`` to fetch the data,
* A Reduce Expression ``B`` to select the ``Last`` (=current) values from query A,
* A Math Expression ``C`` that contains the treshold, f.e. ``$B > 0.5``.

Grafana subsequently reevaluates the alert every given interval (10s minimum), and it holds for another interval (10s minimum), the alert will fire. The following *additional details* can additionally be configured to be sent along with the alert:

* ``Dashboard UID``: which dashboard to link to in the alert,
* ``Panel ID``: which panel to link to in the alert,
* ``Severity``: severity of the alert: ``critical``, ``major``, ``minor``, ``warning`` (default).

Alerta
```````````````````````````````````````````

The Alerta stack manages alerts that come from Grafana, and can be accessed through http://localhost:8082 (creds: admin/alerta). The main screen shows you an overview of the *unacknowledged alerts* that were generated by Grafana. Alerta allows an operator to track them using the `ISA 18.2 <http://www.tc.faa.gov/its/worldpac/Standards/isa/ISA_18.2[1].pdf>`_ alarm model, which has the following states per alert:

* ``NORM``: Condition is normal: alarm is not active, all past alarms were acknowledged,
* ``UNACK``: Alarm is active, and has not been acknowledged ("came"),
* ``RTNUN``: Alarm came and went, but has not been acknowledged ("went"),
* ``SHLVD``: Shelved: condition changes are ignored.

Alerts arrive in the ``UNACK`` state, and will alternate between ``UNACK`` and ``RTNUN`` until the user acknowledges the alert. Once acknowledged, the alert will not appear until it is triggered once again.

.. note:: Alerta will generate a message on *Slack* any time an alert is freshly generated (goes from ``NORM`` to ``UNACK``).