Anticipating Problems with the Second Way Part 2

Alerts for Undesired Outcomes

Even with appropriate metrics, teams might not realise that problems have occurred unless someone is always watching the values. Having someone watch all metrics constantly would be nearly impossible. Especially in large organisations with huge development pipelines, there might be a staggering number of individual values and metrics. However, there are several methods to inform teams that something has gone awry. Organisations might set up alerts, which automatically update key team members when metrics fall outside the acceptable ranges.

The exact details of these alerts vary between organisations, but many use tiers of alerts to announce problems to more team members as problems escalate. For example, values outside a single standard deviation from the average may show a notification on a dashboard of some sort. When values reach two standard deviations from average, there may be an email to team management. Values that fall outside the three standard deviation range may generate an automated call or SMS to on-call team members. Like many other parts of DevOps, alerts are up to the individual team to determine exact details. However, a tiered approach is often most effective for appropriately announcing potential problems.

Problems with Data

While metrics and alerts work with many development pipelines, there are occasionally problems with data. Using just numeric figures may not always tell the full story. Sometimes a potential problem may not translate well to numerical values. Even if the team can distil a definite metric from part of the pipeline, it may not be a figure that the team members readily understand. Just because teams can pull data out of a process, the data may not always be valuable.

Problems with Non-Gaussian Distributions

Data that has “non-Gaussian distribution” is simply information that does not appear normal, in a bell curve or something other standard format. For these unique cases, measure of centre may be impossible to find, or ranges may be too erratic to reliably indicate problems. Because of this, standard deviation and set ranges may not be valuable for effective alerts. In some extreme cases, going a few standard deviations below the average may be a negative value, which would never happen in a practical application. If teams attempt to apply alerts to these values, it can result in over-alerting or under-alerting, and prevent teams from seeing when problems occur.

Solutions

Though non-Gaussian distributions do complicate metrics and alerts, there are ways to create meaningful alerts from any data value. When values change in consistent patterns, it may be appropriate to use some predictive analysis. If the value falls outside the predicted range, there may be a problem, even if it is a value that does occur in some points along the pattern. By adjusting the high and low ends of the range according to anticipated values, the team can still create triggers for anomaly values. These solutions are certainly more complicated than clear cut ranges or standard deviation tiers, but an automated system will equip the team to see problems with even the most complex data.

Back