Skip to content
Snippets Groups Projects
Select Git revision
  • 391ac61b1dc719d9521236a80660d2578b449f07
  • main default protected
  • add-wr-monitoring
  • L2SS-2153---Add-Datasource
  • add-python-package-template
  • L2SS-2104-update-loki-config
  • L2SS-2107-annotate-grafana-service-token
  • add-ipmi-exporter
  • L2SS-1002-create-docker-image
  • central-monitor-0.0.1
10 results

user_manual.rst

Blame
  • Code owners
    Assign users and groups as approvers for specific file changes. Learn more.
    user_manual.rst 16.80 KiB

    User Manual

    The Grafana system exposed on http://localhost:3001 allows visualisation of the monitoring information collected by Prometheus (and other sources). It contains, with links to the relevant Grafana documentation:

    • A series of dashboards, organised into folders. Each dashboard is an independent page of visualisations. If you login, you will see the configured "Home" dashboard.
    • Each dashboard has a series of panels, often organised into collapsable rows. Each panel contains a specific visualisation, and can have alarms configured on them. The panels are tiled.
    • Each panel has a set of queries, which describe the data to be visualised, and a single visualization, which is how the data is visualised.

    The Grafana documentation will help you with using Grafana in general. Also useful are the following videos and posts:

    Finally, be sure to check out the webinars and videos provided by the Grafana team.

    Writing Queries

    Most of the data will be queried from the Prometheus backend:

    The Prometheus database is flat, containing time-series for metrics which carry a name, labels, and a float value:

    attribute_name{label="value", ...} attribute_value

    For example:

    device_attribute{host="dop496", station="DTS Outside", device="stat/sdp/1", name="FPGA_temp_R", x="03", y="00"} 42.3

    The queries express selections on these entries for a given name, filtered by the given labels. For example, the following query returns all FPGA temperatures across all stations, including the above entry:

    device_attribute{device="stat/sdp/1", name="FPGA_temp_R"}

    Furthermore, values of different metrics can be combined (added, merged, etc). See the PromQL documentation for more details, or read:

    Querying LOFAR Station Control

    The LOFAR Station Control software exposes a series of metrics from each station:

    device_attribute:

    All monitoring points from Tango, that are configured to be exposed to Prometheus. For arrays, each element is its own metric. It carries the following labels:

    job: stations
    host: Station hostname from which the value was obtained (f.e. dts-lcu),
    station: Name of the station, as reported by the station (f.e. DTS) (NB: for now, the host is more reliable to use),
    device: Tango device of this attribute (f.e. stat/recv/1),
    name: Tango attribute name (f.e. ANT_mask_RW),
    type: Data type (f.e. string, float, bool),
    x: Offset in the first dimension, if the attribute is a 1D or 2D array, or "00",
    y: Offset in the second dimension, if the attribute is a 2D array, or "00",
    idx: Global offset in the array, combining x and y,
    str_value: The value of the attribute, if the attribute type is a string.
    device_scraping:

    Time required to scrape each Tango device, in seconds. It carries the following labels:

    job: stations
    host: Station hostname from which the value was obtained (f.e. dts-lcu),
    station: Name of the station, as reported by the station (f.e. DTS) (NB: for now, the host is more reliable to use),
    device: Tango device scraped.

    Metrics from the non-Tango services are exposed as well. See the linked documentation, or use the interactive interfaces, to explore them further:

    scrape_*:

    Metrics describing scraping (=Prometheus periodically requesting the metrics), see https://prometheus.io/docs/concepts/jobs_instances/.

    job: stations
    host: Station hostname from which the value was obtained (f.e. dts-lcu),
    exported_job: Original job on the station (host, prometheus, grafana).
    node_*:

    Metrics describing the server, see https://github.com/prometheus/node_exporter.

    job: stations
    host: Station hostname from which the value was obtained (f.e. dts-lcu),
    exported_job: host
    go_*, grafana_*:

    Metrics from Grafana, see https://grafana.com/docs/grafana/latest/administration/view-server/internal-metrics/ and https://grafana.com/docs/grafana/latest/alerting/unified-alerting/fundamentals/evaluate-grafana-alerts/.

    job: stations
    host: Station hostname from which the value was obtained (f.e. dts-lcu),
    exported_job: grafana

    Querying Operational Central Management

    This software stack itself also exposes metrics from its various services:

    scrape_*:

    Metrics describing scraping (=Prometheus periodically requesting the metrics), see https://prometheus.io/docs/concepts/jobs_instances/.

    job: prometheus
    node_*:

    Metrics describing the server, see https://github.com/prometheus/node_exporter.

    job: host
    go_*, grafana_*:

    Metrics from Grafana, see https://grafana.com/docs/grafana/latest/administration/view-server/internal-metrics/ and https://grafana.com/docs/grafana/latest/alerting/unified-alerting/fundamentals/evaluate-grafana-alerts/.

    job: grafana

    Query Tricks

    Standard deviation between non-zero elements within an array (but not over time):

    stddev(device_attribtue{host="$station", device="stat/recv/1", name="RCU_TEMP_R"} > 0)

    Standard deviation for each element individually over time (but not between elements). Using $__range_interval, we ensure that the last data point in the panel covers all the data in the panel. So using an Instant query or otherwise the last value is recommended:

    stddev_over_time(device_attribtue{host="$station", device="stat/recv/1", name="RCU_TEMP_R"}[$__range_interval])

    Standard deviation for each element over time, for elements that are non-zero over the full range. The avg_over_time(...) > bool 0 returns 1 for elements we want, and 0 for those we don't. We use the construct mask * (result + 1) > 0 - 1 to a) filter any result for which the mask is 0, and b) retain values for which the result is 0:

    (avg_over_time(device_attribute{host="$station", device="stat/recv/1", name="RCU_TEMP_R"}[$__range_interval]) > bool 0)
    * on(x,y)
    (stddev_over_time(device_attribtue{host="$station", device="stat/recv/1", name="RCU_TEMP_R"}[$__range_interval]) + 1) > 0
    - 1

    Masked values. If we have an attribute which is covered by some mask, we can use the following to force values outside of the mask to 0. The following formula returns 1 only for FPGAs which have a communication error, and are enabled in TR_fpga_mask_R:

    device_attribute{host="$station",device="stat/sdp/1",name="TR_fpga_communication_error_R"}
    * on(x,y) (device_attribute{host="$station",device="stat/sdp/1",name="TR_fpga_mask_R"} > bool 0)

    Masked boolean values, but we also want to process values for disabled elements. If we have an attribute which has a boolean value and is covered by some mask, we can distinguish the 4 combinations by using the following formula, which result in 0 = False, disabled in mask, 1 = True, disabled in mask, 2 = False, enabled in mask, 3 = True, enabled in mask:

    device_attribute{host="$station",device="stat/sdp/1",name="TR_fpga_communication_error_R"}
    + on(x,y) (2 * device_attribute{host="$station",device="stat/sdp/1",name="TR_fpga_mask_R"})

    Map Panels

    Grafana provides panels to display points on a map, where both the locations of the pins and their colours are drawn from the database. We use the Orchestra Cities Map plugin for the best results.

    The position information is best used in the Geohash format, which encodes latitude and longitude as a single string. The station exposes the following geohash positions as attributes:

    # Position of each HBAT
    device_attribute{host="$station", device="stat/antennafield/1", name="HBAT_reference_GEOHASH_R"}
    
    # Position of each Antenna Field
    device_attribute{host="$station", device="stat/antennafield/1", name="Antenna_Field_Reference_GEOHASH_R"}

    To use these in the Orchestra map, you need to configure the Data Layer as follows:

    • Set Location to Geohash,
    • Set Geohash field to str_value.

    Furthermore, you will want to consider:

    • Base layer to Open Street Map,
    • Map view -> Initial view -> View to Auto Center,

    To add colours for each dot, we need to combine the position with the value of another metric, for example:

    sum by (host, x) (device_attribute{host="$station", device="stat/antennafield/1", name="HBAT_PWR_on_R"})
    + on(host, x) group_right() (device_attribute{host="$station", device="stat/antennafield/1", name="HBAT_reference_GEOHASH_R"} * 0)

    In which HBAT_PWR_on_R represents the power of an HBAT element, which we sum per tile over all elements using sum by (host, x). The output of this, the first line, will determine the colour. The position is added by using group_by, which adds the critical str_value of the HBAT_reference_geohash_R. By using + on(...) (... * 0), we make sure the metric value is not influenced by the value of the geohash metric.

    The colour is configured in the Data Layer by setting Marker Color to Value, and configuring the Thresholds at the bottom for the actual colour to be used for each range of values.

    DrawIO Panels

    The FlowCharting plugin allows you to create a drawing in Draw.io, and replace values or colour elements according to query results.

    See also the FlowCharting documentation.

    Advanced: SVG Panels

    It is possible to display an SVG picture, and have areas of it highlight based on query results. We find the ACE.SVG plugin to be useful.

    An example setup can be constructed as follows:

    • Add a query with the Alerta UI data source, with:

      • Rename query to AlertaAlerts,
      • URL set to http://alerta-server:8080/api/alerts,
      • Rows/Root set to alerts,
      • Add Query Param: key status value open (to filter out any closed alerts).
    • Add a query with the Grafana API data source, with:

      • Rename query to GrafanaAlerts,
      • URL set to http://localhost:3000/api/alertmanager/grafana/api/v2/alerts,
      • Rows/Root set to alerts.
    • Add a query with the Prometheus data source, with:

      • Rename query to PrometheusData,
      • Query set to f.e. scrape_samples_scraped.
    • Load an SVG, with at least 2 named regions,

    • Under SVG Mappings, map these two regions to the names RegionA and RegionB,

    • Put the content of lofar-svglib.js in User JS Init Code.

    • Put the following code in User JS Render Code:

      // find the right data series
      let series = data.series.find(
        x => x.refId == "PrometheusData"
          && x.fields[1].labels.device == "total"
      )
      
      // use the last value
      let buffer = series.fields[1].values.buffer
      let lastValue = buffer[buffer.length-1]
      
      // colour RegionA accordingly
      svgmap.RegionA.css('fill', lastValue > 1 ? '#f00' : '#0f0')
      
      // link it to a fixed URL
      svgmap.RegionA.linkTo(function(link) {
        link.to('http://www.google.com').target('_blank')
      })
      
      // lookup an alert
      alert = get_alert(data, "test")
      
      // colour RegionB accordingly
      svgmap.RegionB.css('fill', alert.colour)
      
      // link it to the alert URL
      if (alert.href) {
        svgmap.RegionB.linkTo(function(link) {
          link.to(alert.href).target('_blank')
        })
      }
      
      console.log("refreshed")

    Alerting

    Alerts on data are generated by Grafana, by periodically polling their underlying queries and triggering an alert once an alarm condition is met. These alerts are forwarded to Alerta, in which the user can manage (and annotate) them.

    In Grafana, alerts can be setup in two ways:

    • Attached to a panel: through the Alert tab when editing a panel,
    • Free floating: through the left-hand bar under Alerting -> Alerting Rules.

    Each rule consists of:

    • A Query that selects the results on which to trigger,
    • Expression(s) that says when the alert should fire.

    Grafana is capable of generating multiple instances of the same alert, each of which covering one of the query results. This allows us to track different sources of the same alert individually, yet grouped. To do so, we need to retain the labels of each result as returned by the query, as each unique set of labels results in a different instance of the alert.

    As the "Classical condition" Expression advised by Grafana drops all labels, we need to do something different:

    • A Query A to fetch the data,
    • A Reduce Expression B to select the Last (=current) values from query A,
    • A Math Expression C that contains the treshold, f.e. $B > 0.5.

    Grafana subsequently reevaluates the alert every given interval (10s minimum), and it holds for another interval (10s minimum), the alert will fire. The following additional details can additionally be configured to be sent along with the alert:

    • Dashboard UID: which dashboard to link to in the alert,
    • Panel ID: which panel to link to in the alert,
    • Severity: severity of the alert: critical, major, minor, warning (default).

    Alerta

    The Alerta stack manages alerts that come from Grafana, and can be accessed through http://localhost:8082 (creds: admin/alerta). The main screen shows you an overview of the unacknowledged alerts that were generated by Grafana. Alerta allows an operator to track them using the ISA 18.2 alarm model, which has the following states per alert:

    • NORM: Condition is normal: alarm is not active, all past alarms were acknowledged,
    • UNACK: Alarm is active, and has not been acknowledged ("came"),
    • RTNUN: Alarm came and went, but has not been acknowledged ("went"),
    • SHLVD: Shelved: condition changes are ignored.

    Alerts arrive in the UNACK state, and will alternate between UNACK and RTNUN until the user acknowledges the alert. Once acknowledged, the alert will not appear until it is triggered once again.

    Note

    Alerta will generate a message on Slack any time an alert is freshly generated (goes from NORM to UNACK).