Storage monitoring with Grafana

Painting by Numbers

Reading Complete SNMP Tables

SNMP organizes various items of system information in tables, which is quite practical for my purposes because Telegraf can retrieve complete SNMP tables in a single action. With a networked storage system, administrators naturally want to know exactly how many gigabytes are coming in and going out over the network interfaces. Telegraf therefore collects the complete network table:

name = "if"
inherit_tags = [ "hostname" ]
oid = "IF-MIB::ifXTable"
name = "ifName"
oid = "IF-MIB::ifName"
is_tag = true

The ifName field is a table index, which makes it easy later to display the values of the various network interfaces separately. This example could also be used to monitor a managed network switch. The IF-MIB then lists all switch ports and their loads, and it works for Fibre Channel switches, as well. The input

snmpwalk -v 2c -c public IF-MIB::ifXTable

shows the complete table content, including the names of the interfaces and many different counters for packets sent and received.

On the basis of the same pattern, Telegraf will also import the disk I/O values into InfluxDB, which are also organized in a standard MIB table; the device name later acts as an index. The input/output operations per second (IOPS) here are far more interesting than the throughput (MBps) per disk. Bandwidth bottlenecks in sequential data transfer are primarily caused by the network connection. The disks, on the other hand, with limited IOPS, cause problems in the case of many small instances of random access, such as database queries or simultaneous access by different clients. The entries are thus:

name = "diskio"
inherit_tags = [ "hostname" ]
oid = "UCD-DISKIO-MIB::diskIOTable"
name = "DiskName"
oid = "UCD-DISKIO-MIB::diskIODevice"
is_tag = true

Telegraf retrieves information on disk allocation and system load from two further standard SNMP tables (Listing 3).

Listing 3

Disk and Load Requests

name = "diskusage"
inherit_tags = [ "hostname" ]
oid = "HOST-RESOURCES-MIB::hrStorageTable"
name = "VolumeName"
oid = "HOST-RESOURCES-MIB::hrStorageDescr"
is_tag = true
name = "load"
inherit_tags = [ "hostname" ]
oid = "UCD-SNMP-MIB::laTable"
name = "loadtime"
oid = "UCD-SNMP-MIB::laNames"
is_tag = true

For the moment, all desired values for visualization are in Grafana where needed. Because InfluxDB does not require a rigid database structure, you can add more tables or single values to the configuration later on. Additionally, you can use this configuration, as mentioned above, to query data from several systems. The

inherit_ tags = [ "hostname" ]

entry tells InfluxDB queries to select values as a function of the system, but more about this later.

To check whether the Telegraf configuration actually works, first issue the telegraf -test command. The tool then parses the configuration, executes the queries, and displays the results at the command line. You can check whether the results suit your needs and, if not, change the queries. If everything is fine, enter

systemctl restart telegraf
systemctl enable telegraf

to start the service and deliver fresh metrics to InfluxDB every 30 seconds.

Creating Custom Dashboards

The user interface tool works on a simple principle: data sources with information on the one hand and visualizations that display data from the sources on the other. Grafana combines several visualizations in dashboards and has a simple user and rights system, as well. Therefore, you can restrict access to dashboards to individual groups and users. Here, however, I will not be looking at access controls.

A newly installed Grafana first requires a new password and the first data source. In this case, InfluxDB is on http://localhost:8086, Access: Server (Default) with the telegraf database, which does not require a username and password. I quickly create a new dashboard named Synology and get started with an initial visualization task showing the network traffic (Figure 1).

Figure 1: Overview of NAS performance. The dashboard displays disk and network I/O, system load, and storage system utilization level.

On the dashboard the Add Panel button in the top starts the dialog for the new visualization, which first wants information about the query. Grafana does not require manual input; rather, it relies on point-and-click in the Query Builder, which greatly simplifies even the more complex database queries. The query starts with the FROM statement. The first and only data source is also the default system. The table is simply named if in the Telegraf configuration set up earlier, and the WHERE selection filters for host and interface names. The NAS in the test goes by the name fatbox and has two LAN ports, of which only eth1 is attached to the switch. The selection is therefore:

FROM default if WHERE ifName = eth1 AND hostname = fatbox

For the SELECT statements, Grafana now only suggests the fields that match the FROM filter criteria. SNMP does not provide values in megabits per second, but simply counts the incoming and outgoing network octets (bytes) in 64-bit counters. The following selection is required for a value in bits per second:

SELECT field(ifHCInOctets) mean() derivative(1s) math(*8) alias(IN)

The ifHCInOctets field is a 64-bit integer that returns the number of incoming octets; the derivative(1s) function calculates the change from second to second. With new values only every 30 seconds, mean() determines the mean value between the last data points, and math(*8) converts the octet (=byte) per second into a bit per second value. The alias(IN) is only used for cosmetic reasons so that the legend for the graph reads if.IN .

To display the OUT value, as well, simply click on the + at the end of the query and scroll to fields/field . Grafana then duplicates the existing query into a second SELECT query. This second line is then assigned the (ifHCOutOctets) field and the alias (OUT):

SELECT field(ifHCOutOctets) mean() derivative(1s) math(*8) alias(OUT)

Now the visualization will show the incoming and outgoing network traffic, but the two graphs overlap. To make this a little more clear cut, simply assign the math() entry of the OUT graph a math(*-8) entry instead of math(*8). Grafana now visualizes the OUT traffic in a far more intuitive graph as a negative value in the downward direction.

For up-to-date values at all times, you can set the displayed time span and refresh interval in the upper right corner of the dashboard. In this early phase, Last 1 hour Refresh every 30s is recommended.

Grafana shows the section icons for further graph configuration to the left of the query. From the Visualization icon, you can set the graphic type and the display options. The default values will normally be fine for line graphs. In the Axes | Left Y section, you can define the measurement Unit ; in this example it is Data Rate – bit/s . The General tab has a field for the visualization name, which is then saved by pressing the Save Dashboard icon at the top of the screen.

Refining the Display

Using the same procedure, you can create a second panel with the disk IOPS (Figure 2). Choose whether you want to monitor the values of all physical disks separately (i.e., sda, sdb, sdc). However, the monitoring example here only considers the multidisk device dm-1 (i.e., the software raid that the Synology NAS has created from the disks). Depending on the configuration of your network storage, completely different device names will appear here.

Figure 2: Grafana's graphical query tool builds InfluxDB queries and displays the results immediately.

Those with iSCSI target services and a block back end will find their iSCSI target devices listed separately as dm-2 , dm-3 , and so on. iSCSI targets in file mode, on the other hand, save the virtual disk as a file, whose IOPS appear as part of dm-1 and cannot be monitored separately. The query for IOPS is the same:

FROM default diskio   WHERE DiskName=dm-1 AND hostname = fatbox
SELECT field(diskIOReads) mean() derivate(1s) math(*-1) alias(ReadIO)

The value read here extends in the downward direction because of the math(*-1) entry, so that it visually reflects the OUT network. The second graph, similar to the first, uses diskIOWrites and the alias(WriteIO) field and omits the math() field. Now just fine tune the appearance and the second panel is done and dusted.

The last item of the Graph section lets you create alerts. Grafana can then alert in various channels if the monitored values drop below or climb above a certain value over a defined period of time. This is not necessary for I/O values. However, values such as temperatures, fan speeds, or UPS battery status can be called up with this SNMP setup  – then the alert function makes sense. In addition to good old email, Grafana can control messengers such as Slack, Telegram, and Discord by configurable notification channels.

To display the fill level of the NAS as a percentage, you need a more complex query and a nice Singlestat Panel (Figure 3). InfluxDB can perform mathematical calculations in the queries. For percent level of the NAS /volume1 filesystem, you need to divide the SNMP hrStorageUsed value from hrStorageTable by the total capacity hrStorageSize and multiply the result by 100.

Figure 3: NAS fill in a Singlestat Panel.

For queries that Grafana cannot create in the user interface, you can first create a more simple query with the graphical tools (e.g., only for hrStorageUsed, but then switch from the graphical to the text query by pressing the eye icon). You will find the following query on doing so:

SELECT mean("hrStorageUsed" )/mean("hrStorageSize")*100 FROM "diskusage" WHERE ("VolumeName"='/volume1') AND ("hostname"='fatbox') AND $timeFilter GROUP BY time($__interval) fill(previous)

Unlike the user interface query, the text query displays the required InfluxDB syntax with parentheses and quotes. You can check immediately whether your manual edits to the query actually work with the Query Inspector, which displays the complete output of a query onscreen, including all error messages.

To display this query appropriately, select the Singlestat option in the Visualization section. To make it pretty, add the following values to the Value pane: Stat: current ; Unit percent (0-100) ; Threshholds: 50,80 with the colors green, yellow, and red; and Gauge: Show , with Threshhold Markers checked. These markers change the color of the graph if the values specified are exceeded. This graph will show red as of 80 percent occupancy of the data carrier. The Stat: current entry forces the display to show the latest acquired value. If the Stat: avg default is kept, the graph shows the average value over the period selected top right.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus