Lead Image © rudall30, 123RF.com

Lead Image © rudall30, 123RF.com

Detect anomalies in metrics data

Jerk Detector

Article from ADMIN 70/2022
Anomalies in an environment's metrics data are an important indicator of an attack. The Prometheus time series database automatically detects, alerts, and forecasts anomalous behavior with the Fourier and Prophet models of the Prometheus Anomaly Detector.

Attacks on environments are just as much a part of the daily grind in IT as operating the IT infrastructure itself. The range of attacks is wide and depends on the attacker's goals. Classic denial-of-service attacks are not complex and quite easy to detect. However, when the focus shifts to sniffing data, the methods are far more subtle, and highly complex IT attacks on different levels are no longer challenging.

As complex as the attack scenarios are, one factor remains the same: Administrators want to notice as early as possible that bad things are going on in their setups so they can react promptly. The sooner an attack is detected, the sooner it can be counteracted and the less damage it can cause.

Rigid Limits of Limited Use

The ability to detect an attack early depends on the tools available and how you use them. In the past, most admins relied on run-of-the-mill event monitoring with thresholds: If the incoming data volume exceeded a certain limit, the monitoring system sounded an alarm. If too many invalid login attempts appeared in the servers' authentication logfiles, you were notified. The focus here is on enabling you to act as quickly as possible in a specific case (i.e., conveying the current situation).

This approach is not particularly up to date or smart. Modern monitoring systems like Prometheus collect such large volumes of metrics data that it can be used to identify trends and anomalies, potentially indicating that attacks are in progress. Even distributed denial-of-service (DDoS) attacks have ceased to follow the principle of taking a server offline with as much traffic as possible in as short a time as possible. Instead, postmortem analyses of attacks regularly reveal that attackers successively increased the traffic in the weeks leading up to an attack and did so in such a way that they always flew under the radar of the thresholds in monitoring. At the decisive moment, a relatively small peak in the attack volume was the final straw that broke the servers' backs. With better trend analysis (e.g., with the help of Prometheus), such attacks become quite predictable.

Gaussian Z-Scores

The statistical Z-score plays an important role when it comes to detecting anomalies, allowing you to define what an anomaly is in the context of a particular environment. Large infrastructures, for example, will apply far higher thresholds for DDoS than websites with only a few visits per day. From your point of view, anomaly detection now means finding a reliable mean value for individual datapoints and then defining limits within which the current measured values are allowed to deviate from the norm. The "cry wolf" effect of permanent false positives should not be underestimated. Sooner or later, no one will take a monitoring system seriously if it constantly sounds the alarm without reason. Instead of a blunt weapon, a fine scalpel comes into play when detecting anomalies in metrics data, and the Z-score is a prime example of a particularly good scalpel.

A little excursion into the world of Carl Friedrich von Gauss's mathematics is unavoidable. Most people have probably heard of Gaussian normal distribution. Simply put, Gaussian theory states that, for any number of measured values, the extremes occur rarely and the median (i.e., the 50th percentile) occurs particularly frequently. On both sides of the x -axis, the number of matches per value increases as the median is approached. Given 100 servers, the power consumption of most devices is likely to fall around the median, with a few individual machines requiring particularly greater or very little power. These values form the outer extremes of an imaginary chart with all measured datapoints.

Percentiles generally play an important role in calculating the Z-score. The first step is to calculate the median, which is the 50th percentile (i.e., 50 percent of all measured values correspond to this value). The Z-score is used to find out how far a single datapoint deviates from this median and is calculated by

which can be either defined with generic values or determined individually. Common values are the 68th percentile (i.e., in a dataset of 100 values, 68 of those fall within +/-1 standard deviation [SD] of the mean), the 95th percentile (+/-2SD), and the 99.7th percentile (+/-3SD).

Red Hat with Groundwork

Now the question arises as to what you need to do to generate appropriate alerts from your metrics data, which is a practical possibility only if you use a time series database (e.g., Prometheus). Prometheus collects the metric values from "exporters" on the target systems and stores them centrally. This data can be evaluated by a custom query language, and Grafana can display the Prometheus data graphically. Prometheus generates alerts with its alert manager component if individual metrics assume certain values or are outside of defined limits.

The question of how use an existing Prometheus installation for effective anomaly detection is provided by Red Hat. The Prometheus Anomaly Detector (PAD) comprises several components designed to detect anomalies from historic data on the one hand and machine learning and projection on the other.

A look under the hood shows the combination of components. Red Hat generally assumes that you will roll out PAD as a component in OpenShift, although this scenario is not enforced. However, if you want to use PAD, you will need an environment comprising Prometheus, Alertmanager, and Thanos – more about this in a moment – because PAD does not seek to be a monitoring tool itself, but to dock onto existing setups. At its core, PAD is a Python application that applies two Python libraries for artificial intelligence (AI) and machine learning (ML): Fourier and Prophet.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs

Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.