user_manual.rst
-
Jan David Mol authoredJan David Mol authored
User Manual
The Grafana system exposed on http://localhost:3001 allows visualisation of the monitoring information collected by Prometheus (and other sources). It contains, with links to the relevant Grafana documentation:
- A series of dashboards, organised into folders. Each dashboard is an independent page of visualisations. If you login, you will see the configured "Home" dashboard.
- Each dashboard has a series of panels, often organised into collapsable rows. Each panel contains a specific visualisation, and can have alarms configured on them. The panels are tiled.
- Each panel has a set of queries, which describe the data to be visualised, and a single visualization, which is how the data is visualised.
The Grafana documentation will help you with using Grafana in general. Also useful are the following videos and posts:
- Grafana Dashboard: Monitor CPU, Memory, Disk and Network Traffic Using Prometheus and Node Exporter [video, 26m], explains how to build dashboards in Grafana,
- Guide to Grafana 101: Getting Started With (Awesome) Visualizations [video, 37m], explains setting up visualisations in Grafana (using a TimescaleDB data source),
- Guide to Grafana 101 [videos], series of videos explaining both beginner and advanced topics,
- How to build a Prometheus query in Grafana [video, 4m], explains how to use Prometheus in Grafana,
- Using query builders for Grafana Loki and Prometheus [e-learning, 15m], interactively helps you use the Grafana query builders to create Prometheus queries in Grafana,
- Grafana Alerting: Explore our latest updates in Grafana 9 [blog, 6m], explains how Alerting works in Grafana,
- FlowCharting Getting Started [blog], explains how to use the FlowCharting plugin to animate draw.io drawings in Grafana.
Finally, be sure to check out the webinars and videos provided by the Grafana team.
Writing Queries
Most of the data will be queried from the Prometheus backend:
- Grafana provides a Prometheus query editor to interactively setup queries,
- The queries themselves use the PromQL syntax.
- Apart from configuring panels, you can also play with queries in the Explore tab (http://localhost:3001/explore), and directly in the Prometheus backend (http://localhost:9091).
The Prometheus database is flat, containing time-series for metrics which carry a name, labels, and a float value:
attribute_name{label="value", ...} attribute_value
For example:
device_attribute{host="dop496", station="DTS Outside", device="stat/sdp/1", name="FPGA_temp_R", x="03", y="00"} 42.3
The queries express selections on these entries for a given name, filtered by the given labels. For example, the following query returns all FPGA temperatures across all stations, including the above entry:
device_attribute{device="stat/sdp/1", name="FPGA_temp_R"}
Furthermore, values of different metrics can be combined (added, merged, etc). See the PromQL documentation for more details, or read:
Querying LOFAR Station Control
The LOFAR Station Control software exposes a series of metrics from each station:
device_attribute: |
All monitoring points from Tango, that are configured to be exposed to Prometheus. For arrays, each element is its own metric. It carries the following labels:
|
||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
device_scraping: |
Time required to scrape each Tango device, in seconds. It carries the following labels:
|
Metrics from the non-Tango services are exposed as well. See the linked documentation, or use the interactive interfaces, to explore them further:
scrape_*: |
Metrics describing scraping (=Prometheus periodically requesting the metrics), see https://prometheus.io/docs/concepts/jobs_instances/.
|
||||||
---|---|---|---|---|---|---|---|
node_*: |
Metrics describing the server, see https://github.com/prometheus/node_exporter.
|
||||||
go_*, grafana_*: |
Metrics from Grafana, see https://grafana.com/docs/grafana/latest/administration/view-server/internal-metrics/ and https://grafana.com/docs/grafana/latest/alerting/unified-alerting/fundamentals/evaluate-grafana-alerts/.
|
Querying Operational Central Management
This software stack itself also exposes metrics from its various services:
scrape_*: |
Metrics describing scraping (=Prometheus periodically requesting the metrics), see https://prometheus.io/docs/concepts/jobs_instances/.
|
||
---|---|---|---|
node_*: |
Metrics describing the server, see https://github.com/prometheus/node_exporter.
|
||
go_*, grafana_*: |
Metrics from Grafana, see https://grafana.com/docs/grafana/latest/administration/view-server/internal-metrics/ and https://grafana.com/docs/grafana/latest/alerting/unified-alerting/fundamentals/evaluate-grafana-alerts/.
|
Query Tricks
Standard deviation between non-zero elements within an array (but not over time):
stddev(device_attribtue{host="$station", device="stat/recv/1", name="RCU_TEMP_R"} > 0)
Standard deviation for each element individually over time (but not between elements). Using $__range_interval
, we ensure that the last data point in the panel covers all the data in the panel. So using an Instant
query or otherwise the last value is recommended:
stddev_over_time(device_attribtue{host="$station", device="stat/recv/1", name="RCU_TEMP_R"}[$__range_interval])
Standard deviation for each element over time, for elements that are non-zero over the full range. The avg_over_time(...) > bool 0
returns 1 for elements we want, and 0 for those we don't. We use the construct mask * (result + 1) > 0 - 1
to a) filter any result for which the mask is 0, and b) retain values for which the result is 0:
(avg_over_time(device_attribute{host="$station", device="stat/recv/1", name="RCU_TEMP_R"}[$__range_interval]) > bool 0)
* on(x,y)
(stddev_over_time(device_attribtue{host="$station", device="stat/recv/1", name="RCU_TEMP_R"}[$__range_interval]) + 1) > 0
- 1
Masked values. If we have an attribute which is covered by some mask, we can use the following to force values outside of the mask to 0. The following formula returns 1 only for FPGAs which have a communication error, and are enabled in TR_fpga_mask_R
:
device_attribute{host="$station",device="stat/sdp/1",name="TR_fpga_communication_error_R"}
* on(x,y) (device_attribute{host="$station",device="stat/sdp/1",name="TR_fpga_mask_R"} > bool 0)
Masked boolean values, but we also want to process values for disabled elements. If we have an attribute which has a boolean value and is covered by some mask, we can distinguish the 4 combinations by using the following formula, which result in 0 = False, disabled in mask, 1 = True, disabled in mask, 2 = False, enabled in mask, 3 = True, enabled in mask:
device_attribute{host="$station",device="stat/sdp/1",name="TR_fpga_communication_error_R"}
+ on(x,y) (2 * device_attribute{host="$station",device="stat/sdp/1",name="TR_fpga_mask_R"})
Map Panels
Grafana provides panels to display points on a map, where both the locations of the pins and their colours are drawn from the database. We use the Orchestra Cities Map plugin for the best results.
The position information is best used in the Geohash format, which encodes latitude and longitude as a single string. The station exposes the following geohash positions as attributes:
# Position of each HBAT
device_attribute{host="$station", device="stat/antennafield/1", name="HBAT_reference_GEOHASH_R"}
# Position of each Antenna Field
device_attribute{host="$station", device="stat/antennafield/1", name="Antenna_Field_Reference_GEOHASH_R"}
To use these in the Orchestra map, you need to configure the Data Layer as follows:
- Set Location to Geohash,
- Set Geohash field to str_value.
Furthermore, you will want to consider:
- Base layer to Open Street Map,
- Map view -> Initial view -> View to Auto Center,
To add colours for each dot, we need to combine the position with the value of another metric, for example:
sum by (host, x) (device_attribute{host="$station", device="stat/antennafield/1", name="HBAT_PWR_on_R"})
+ on(host, x) group_right() (device_attribute{host="$station", device="stat/antennafield/1", name="HBAT_reference_GEOHASH_R"} * 0)
In which HBAT_PWR_on_R represents the power of an HBAT element, which we sum per tile over all elements using sum by (host, x). The output of this, the first line, will determine the colour. The position is added by using group_by, which adds the critical str_value of the HBAT_reference_geohash_R. By using + on(...) (... * 0), we make sure the metric value is not influenced by the value of the geohash metric.
The colour is configured in the Data Layer by setting Marker Color to Value, and configuring the Thresholds at the bottom for the actual colour to be used for each range of values.
DrawIO Panels
The FlowCharting plugin allows you to create a drawing in Draw.io, and replace values or colour elements according to query results.
See also the FlowCharting documentation.
Advanced: SVG Panels
It is possible to display an SVG picture, and have areas of it highlight based on query results. We find the ACE.SVG plugin to be useful.
An example setup can be constructed as follows:
-
Add a query with the Alerta UI data source, with:
- Rename query to
AlertaAlerts
, -
URL
set tohttp://alerta-server:8080/api/alerts
, -
Rows/Root
set toalerts
, - Add
Query Param
: keystatus
valueopen
(to filter out any closed alerts).
- Rename query to
-
Add a query with the Grafana API data source, with:
- Rename query to
GrafanaAlerts
, -
URL
set tohttp://localhost:3000/api/alertmanager/grafana/api/v2/alerts
, -
Rows/Root
set toalerts
.
- Rename query to
-
Add a query with the Prometheus data source, with:
- Rename query to
PrometheusData
, - Query set to f.e.
scrape_samples_scraped
.
- Rename query to
-
Load an SVG, with at least 2 named regions,
-
Under
SVG Mappings
, map these two regions to the namesRegionA
andRegionB
, -
Put the content of lofar-svglib.js in
User JS Init Code
. -
Put the following code in
User JS Render Code
:// find the right data series let series = data.series.find( x => x.refId == "PrometheusData" && x.fields[1].labels.device == "total" ) // use the last value let buffer = series.fields[1].values.buffer let lastValue = buffer[buffer.length-1] // colour RegionA accordingly svgmap.RegionA.css('fill', lastValue > 1 ? '#f00' : '#0f0') // link it to a fixed URL svgmap.RegionA.linkTo(function(link) { link.to('http://www.google.com').target('_blank') }) // lookup an alert alert = get_alert(data, "test") // colour RegionB accordingly svgmap.RegionB.css('fill', alert.colour) // link it to the alert URL if (alert.href) { svgmap.RegionB.linkTo(function(link) { link.to(alert.href).target('_blank') }) } console.log("refreshed")
Alerting
Alerts on data are generated by Grafana, by periodically polling their underlying queries and triggering an alert once an alarm condition is met. These alerts are forwarded to Alerta, in which the user can manage (and annotate) them.
In Grafana, alerts can be setup in two ways:
- Attached to a panel: through the
Alert
tab when editing a panel, - Free floating: through the left-hand bar under
Alerting -> Alerting Rules
.
Each rule consists of:
- A Query that selects the results on which to trigger,
- Expression(s) that says when the alert should fire.
Grafana is capable of generating multiple instances of the same alert, each of which covering one of the query results. This allows us to track different sources of the same alert individually, yet grouped. To do so, we need to retain the labels of each result as returned by the query, as each unique set of labels results in a different instance of the alert.
As the "Classical condition" Expression advised by Grafana drops all labels, we need to do something different:
- A Query
A
to fetch the data, - A Reduce Expression
B
to select theLast
(=current) values from query A, - A Math Expression
C
that contains the treshold, f.e.$B > 0.5
.
Grafana subsequently reevaluates the alert every given interval (10s minimum), and it holds for another interval (10s minimum), the alert will fire. The following additional details can additionally be configured to be sent along with the alert:
-
Dashboard UID
: which dashboard to link to in the alert, -
Panel ID
: which panel to link to in the alert, -
Severity
: severity of the alert:critical
,major
,minor
,warning
(default).
Alerta
The Alerta stack manages alerts that come from Grafana, and can be accessed through http://localhost:8082 (creds: admin/alerta). The main screen shows you an overview of the unacknowledged alerts that were generated by Grafana. Alerta allows an operator to track them using the ISA 18.2 alarm model, which has the following states per alert:
-
NORM
: Condition is normal: alarm is not active, all past alarms were acknowledged, -
UNACK
: Alarm is active, and has not been acknowledged ("came"), -
RTNUN
: Alarm came and went, but has not been acknowledged ("went"), -
SHLVD
: Shelved: condition changes are ignored.
Alerts arrive in the UNACK
state, and will alternate between UNACK
and RTNUN
until the user acknowledges the alert. Once acknowledged, the alert will not appear until it is triggered once again.
Note
Alerta will generate a message on Slack any time an alert is freshly generated (goes from NORM
to UNACK
).