user_manual.rst

User Manual
The Grafana system exposed on http://localhost:3001 allows visualisation of the monitoring information collected by Prometheus (and other sources). It contains, with links to the relevant Grafana documentation:
- A series of dashboards, organised into folders. Each dashboard is an independent page of visualisations. If you login, you will see the configured "Home" dashboard.
- Each dashboard has a series of panels, often organised into collapsable rows. Each panel contains a specific visualisation, and can have alarms configured on them. The panels are tiled.
- Each panel has a set of queries, which describe the data to be visualised, and a single visualization, which is how the data is visualised.
The Grafana documentation will help you with using Grafana in general. Also be sure to check out the webinars and videos provided by them.
Writing Queries
Most of the data will be queried from the Prometheus backend:
- Grafana provides a Prometheus query editor to interactively setup queries,
- The queries themselves use the PromQL syntax.
- Apart from configuring panels, you can also play with queries in the Explore tab (http://localhost:3001/explore), and directly in the Prometheus backend (http://localhost:9091).
The Prometheus database is flat, containing time-series for metrics which carry a name, labels, and a float value:
attribute_name{label="value", ...} attribute_value
For example:
device_attribute{host="dop496", station="DTS Outside", device="stat/sdp/1", name="FPGA_temp_R", x="03", y="00"} 42.3
The queries express selections on these entries for a given name, filtered by the given labels. For example, the following query returns all FPGA temperatures across all stations, including the above entry:
device_attribute{device="stat/sdp/1", name="FPGA_temp_R"}
Furthermore, values of different metrics can be combined (added, merged, etc). See the PromQL documentation for more details, or read:
Querying LOFAR Station Control
The LOFAR Station Control software exposes a series of metrics from each station:
device_attribute: |
All monitoring points from Tango, that are configured to be exposed to Prometheus. For arrays, each element is its own metric. It carries the following labels:
|
||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
device_scraping: |
Time required to scrape each Tango device, in seconds. It carries the following labels:
|
Metrics from the non-Tango services are exposed as well. See the linked documentation, or use the interactive interfaces, to explore them further:
scrape_*: |
Metrics describing scraping (=Prometheus periodically requesting the metrics), see https://prometheus.io/docs/concepts/jobs_instances/.
|
||||||
---|---|---|---|---|---|---|---|
node_*: |
Metrics describing the server, see https://github.com/prometheus/node_exporter.
|
||||||
go_*, grafana_*: |
Metrics from Grafana, see https://grafana.com/docs/grafana/latest/administration/view-server/internal-metrics/ and https://grafana.com/docs/grafana/latest/alerting/unified-alerting/fundamentals/evaluate-grafana-alerts/.
|
Querying Operational Central Management
This software stack itself also exposes metrics from its various services:
scrape_*: |
Metrics describing scraping (=Prometheus periodically requesting the metrics), see https://prometheus.io/docs/concepts/jobs_instances/.
|
||
---|---|---|---|
node_*: |
Metrics describing the server, see https://github.com/prometheus/node_exporter.
|
||
go_*, grafana_*: |
Metrics from Grafana, see https://grafana.com/docs/grafana/latest/administration/view-server/internal-metrics/ and https://grafana.com/docs/grafana/latest/alerting/unified-alerting/fundamentals/evaluate-grafana-alerts/.
|
Query Tricks
Standard deviation between non-zero elements within an array (but not over time):
stddev(device_attribtue{host="$station", device="stat/recv/1", name="RCU_TEMP_R"} > 0)
Standard deviation for each element individually over time (but not between elements). Using $__range_interval
, we ensure that the last data point in the panel covers all the data in the panel. So using an Instant
query or otherwise the last value is recommended:
stddev_over_time(device_attribtue{host="$station", device="stat/recv/1", name="RCU_TEMP_R"}[$__range_interval])
Standard deviation for each element over time, for elements that are non-zero over the full range. The avg_over_time(...) > bool 0
returns 1 for elements we want, and 0 for those we don't. We use the construct mask * (result + 1) > 0 - 1
to a) filter any result for which the mask is 0, and b) retain values for which the result is 0:
(avg_over_time(device_attribute{host="$station", device="stat/recv/1", name="RCU_TEMP_R"}[$__range_interval]) > bool 0)
* on(x,y)
(stddev_over_time(device_attribtue{host="$station", device="stat/recv/1", name="RCU_TEMP_R"}[$__range_interval]) + 1) > 0
- 1
Masked values. If we have an attribute which is covered by some mask, we can use the following to force values outside of the mask to 0. The following formula returns 1 only for FPGAs which have a communication error, and are enabled in TR_fpga_mask_R
:
device_attribute{host="$station",device="stat/sdp/1",name="TR_fpga_communication_error_R"}
* on(x,y) (device_attribute{host="$station",device="stat/sdp/1",name="TR_fpga_mask_R"} > bool 0)
Masked boolean values, but we also want to process values for disabled elements. If we have an attribute which has a boolean value and is covered by some mask, we can distinguish the 4 combinations by using the following formula, which result in 0 = False, disabled in mask, 1 = True, disabled in mask, 2 = False, enabled in mask, 3 = True, enabled in mask:
device_attribute{host="$station",device="stat/sdp/1",name="TR_fpga_communication_error_R"}
+ on(x,y) (2 * device_attribute{host="$station",device="stat/sdp/1",name="TR_fpga_mask_R"})
Map Panels
Grafana provides panels to display points on a map, where both the locations of the pins and their colours are drawn from the database. We use the Orchestra Cities Map plugin for the best results.
The position information is best used in the Geohash format, which encodes latitude and longitude as a single string. The station exposes the following geohash positions as attributes:
# Position of each HBAT
device_attribute{host="$station", device="stat/antennafield/1", name="HBAT_reference_GEOHASH_R"}
# Position of each Antenna Field
device_attribute{host="$station", device="stat/antennafield/1", name="Antenna_Field_Reference_GEOHASH_R"}
To use these in the Orchestra map, you need to configure the Data Layer as follows:
- Set Location to Geohash,
- Set Geohash field to str_value.
Furthermore, you will want to consider:
- Base layer to Open Street Map,
- Map view -> Initial view -> View to Auto Center,
To add colours for each dot, we need to combine the position with the value of another metric, for example:
sum by (host, x) (device_attribute{host="$station", device="stat/antennafield/1", name="HBAT_PWR_on_R"})
+ on(host, x) group_right() (device_attribute{host="$station", device="stat/antennafield/1", name="HBAT_reference_GEOHASH_R"} * 0)
In which HBAT_PWR_on_R represents the power of an HBAT element, which we sum per tile over all elements using sum by (host, x). The output of this, the first line, will determine the colour. The position is added by using group_by, which adds the critical str_value of the HBAT_reference_geohash_R. By using + on(...) (... * 0), we make sure the metric value is not influenced by the value of the geohash metric.
The colour is configured in the Data Layer by setting Marker Color to Value, and configuring the Thresholds at the bottom for the actual colour to be used for each range of values.
SVG Panels
It is possible to display an SVG picture, and have areas of it highlight based on query results. We find the ACE.SVG plugin to be useful.
An example setup can be constructed as follows:
-
Add a query with the Alerta UI data source, with:
- Rename query to
AlertaAlerts
, -
URL
set tohttp://alerta-server:8080/api/alerts
, -
Rows/Root
set toalerts
, - Add
Query Param
: keystatus
valueopen
(to filter out any closed alerts).
- Rename query to
-
Add a query with the Grafana API data source, with:
- Rename query to
GrafanaAlerts
, -
URL
set tohttp://localhost:3000/api/alertmanager/grafana/api/v2/alerts
, -
Rows/Root
set toalerts
.
- Rename query to
-
Add a query with the Prometheus data source, with:
- Rename query to
PrometheusData
, - Query set to f.e.
scrape_samples_scraped
.
- Rename query to
-
Load an SVG, with at least 2 named regions,
-
Under
SVG Mappings
, map these two regions to the namesRegionA
andRegionB
, -
Put the content of lofar-svglib.js in
User JS Init Code
. -
Put the following code in
User JS Render Code
:// find the right data series let series = data.series.find( x => x.refId == "PrometheusData" && x.fields[1].labels.device == "total" ) // use the last value let buffer = series.fields[1].values.buffer let lastValue = buffer[buffer.length-1] // colour RegionA accordingly svgmap.RegionA.css('fill', lastValue > 1 ? '#f00' : '#0f0') // link it to a fixed URL svgmap.RegionA.linkTo(function(link) { link.to('http://www.google.com').target('_blank') }) // lookup an alert alert = get_alert(data, "test") // colour RegionB accordingly svgmap.RegionB.css('fill', alert.colour) // link it to the alert URL if (alert.href) { svgmap.RegionB.linkTo(function(link) { link.to(alert.href).target('_blank') }) } console.log("refreshed")