There is a silent vandal lurking in our DevOps. When we look - he obscures. When we listen - he muffles. When we seek - he confounds.
His name is a powerful thing, an oxymoron that holds within it everything you need to know about it - Alert Fatigue.
And we’re not talking about it nearly enough.
We are all dimly aware of the toll taken on our minds by our technology-driven lifestyles. We live in a continual din of alerts, sharpening our senses to input we perceive as relevant - signal - and dulling them to what lies just underneath that threshold - noise.
Alert Fatigue in the field of DevOps and Production Monitoring has been a continual uphill battle for companies striving to enhance their business process management and process optimization. We’ve allowed ourselves to be dazzled by the increasing ease with which data can be extracted and displayed, to fall victim to the old adage: Bigger is Better. Instead of shaping these technologies around our process, we’ve allowed them to dictate it. And we all know, firsthand, exactly how it happened.
Beep Ring Click Tone
Bell Chime Buzz
An alarm goes off in a far corner of your street. You lay in bed, blissfully undisturbed.
Your car had just been stolen.
As soon as we are born and as we grow up and mature, our brains develop the ability tune out noise at the expense of signal integrity.
This is not a mechanism. This is the mechanism.
Our brains are a pattern-matching machine, designed to process, cataloge and prioritize information at breakneck speeds. This is a great gift and something of a uniquely human superpower.
The comfort of this knowledge cannot, sadly, un-steal your car.
An overabundance of information in a pattern-matching system is at risk of being classified and reclassified into what it perceives as repeating patterns. As these patterns are qualified again and again, they ossify in the mind, which starts seeking out and “identifying” these existing patterns instead of watching out for new ones. Pretty soon, you are no longer sensitive to the signal at all. This is also called habituation, and is, in fact, another form of learning.
A savvy human grasps this intuitively, and does what they can to increase their SNR (signal-to-noise ratio), to stay as alert and attentive to what’s actually going on as possible. However, without relevant training and continuous reminders (which we are sorely lacking in our corporate environments), it is near impossible for a person to vanquish this problem on their own.
Sifting the Chaff from the Wheat
Alert Fatigue is caused by the increasing cascade of non-actionable items, false positive and false negatives in our monitoring stack. This is our chaff.
Actionable items are datum we’re trying to correctly identify and act upon.
These include possible harmful behavior, actually harmful behavior, desirable behavior, insights and suggestions for improvement. This is our wheat.
Just Wheat for Me, Please!
If only it were that easy. A perfect alert system strikes a balance between 2 factors: sensitivity, and specificity.
Perfect Sensitivity means that all relevant events are correctly identified as such.
Perfect Specificity means that all irrelevant events are correctly identified as such.
These also outline the two greatest risks:
Bad Sensitivity will have us miss relevant tasks and insightful information.
Bad Specificity will waste resources on irrelevant tasks and misleading information.
As you can imagine, striking the perfect balance between Sensitivity and Specificity has proven one of the hardest problems faced by monitoring and alerting services.
From the medical field, to air travel, to physical safety and through cyber attacks, the statistics on this are staggering.
In the medical profession, the ECRI institute have included Alert Fatigue in their lists of Top 10 Health Technology Hazards for 2015. 19 out of 20 hospitals surveyed in 2014 ranked Alert Fatigue as a top patient safety concern. Alert Fatigue has become a major safety concern and a regulatory priority. In short, Alert Fatigue is costing lives.
In the financial and cyber-security sectors, we can look to the case of Target’s disastrous 2013 data breach, when 40 million card records were stolen along with 70 million records containing identity information such as addresses and telephone numbers. Despite numerous alerts, the staff at Target did not react to this threat in time because similar alerts were commonplace and the security team incorrectly classified them as false positives. Cyber security has now become one of the fastest growing industries in the world, and one that is entirely dependant on timely alerts and reducing non-actionable items and false positives.
The problem with our monitoring is further exasperated by a few key factors:
1) Technological Availability - Automation and monitoring has become increasingly cheap and easy, spiralling us to develop more and more work methods which strive to stack as much information at the analyst's fingertips as possible.
“So you’re charging 1 more dollar a month to double
the amount of information displayed ? Take 2!”
2) False Sense of Security - The abundance of information lulls us into complacency. We have so many monitors, alerts, graphs, dashboards and displays - so as to make us believe we’re surely covered and safe from all all possible scenarios. We come to trust our technology to the point that we believe any issue that isn’t screaming at us from every monitor couldn’t be important, and that anything that didn’t happen until now, will likely never happen.
“Welp, we just added 2 more monitoring tools and quadrupled
the number of alerts by turning up our filters’ Sensitivity.
I’ll never worry about missing anything bad happening ever again.”
3) Ninja Brains - Our brains’ mad pattern-matching skills make us also falsely detect information that isn’t really there, and confirming our previously held biases.
“Oh, I recognize this alert from 10 seconds ago. And every 10 seconds before that.
I know how to handle it, no need to waste time analyzing what is clearly the exact same thing.”
Clearly, nobody is out to fatigue anyone. Workers and employees alike are now becoming more aware of the new and various dangers caused by information overload.
What Companies Are Doing To Address This
There are several lines of solutions to address this serious issue. We will be dividing those into 3 tiers.
Tier 1 Solutions are those that require continuous or near continuous manual maintenance.
Tier 2 Solutions make our monitoring automatically adaptive. These often require some degree of manual maintenance as well, but one that has a much higher ROI.
Tier 3 Solutions are - a secret. Keep reading.
All 3 Tiers have one thing in common - they require that corporations realize there is a serious problem in the field of Production Monitoring today, and that it can’t be solved by relying purely on our human brains. Committing to a solution that makes technology work for us, and not the other way around, is key.
Tier 1 Solutions - Concise Alerting
Drowning in noise, we forget there’s a signal. Optimizing signal-to-noise ratio is what Tier 1 solutions are all about. These include:
- Reducing false positives, false negatives and non-actionable items.
- Consolidating and aggregating alerts.
- Differentiating between important and urgent.
- Knowing who are the right people to notify about specific issues.
- Timing alerts to appropriate times in the work cycle.
- Releasing alerts at a pace relevant to the current ability to handle them.
- Enabling the user to create filters and rules for alerts.
- Providing more context, names and descriptions to making alerts less obtuse.
- Tailoring similar yet disparate objects with unique, individual filters.
- Honorable mention: Rotating the employees monitoring alerts to avoid fatigue.
A lot of these solutions very clearly go hand-in-hand.
For example, providing more context is, in itself, is a worthwhile goal. A lot of alerts suffer from lack of proper context, which cause them to appear obtuse and are therefore ignored. However, it’s quite clear that without reducing the number of false positives, providing more alert context will only exasperate the problem. You don’t fight Information Overload with more information.
The problem with Tier 1 solutions - they require a level of manual maintenance that is, in essence, neverending. Here you are at risk of replacing one problem with another; instead of dealing with the Alert Fatigue resulting from crude, “democratized” filters, now you need to deal with the constant manual maintenance of every node in your network.
Though you are still definitely improving the overall condition of your monitoring and alerts, these solutions in themselves will always be limited by our human capabilities. This is why, on top of Tier 1-type solutions, we must apply Tier 2-type solutions.
Tier 2 Solutions - Smart Alerting
Tier 2 solutions implement self-improving, adaptive behavior. What was previously applied and maintained manually through filters and rules, is now learning and maturing on its own, with varied levels of human involvement. These solutions weave continuous improvement into the system’s DNA.
For this, the system needs to utilize Machine Learning technologies. Supervised and unsupervised learning utilizes statistical models to pattern-match, categorize data points, create decision-trees that train and test data, and use recursion and continual data collection to optimize and reduce overfitting. By doing this, you create a hierarchy which is capable to observe sequences of patterns, enabling it to create insights and predict outcomes. Adaptive thresholds are created, adapting to what is expected at a given moment, and what isn’t, and acting accordingly.
The problem with Tier 2 solutions - they are still ultimately dependent on users reading logs and other machine-driven data and its by-products (graphs, dashboards, statistics).
When solving a complexly technological and psychological problem such as Alert Fatigue we need to take a stand to distinguish between the solutions that keep feeding the problem, and those that render the question moot.
Tier 3 Solutions - Beyond Alerting
In order to truly move beyond this question, a paradigm shift is required. We need to rethink the axioms we’re operating under and snap out of our Alert-laden haze.
A lot of the solutions outlined above overlook a major impediment to finding a longstanding solution to this problem. This impediment rarely receives notice since it undermines the axiom at the base of DevOps and Production Monitoring. Tier 3 solutions go beyond alerting - they enable almost anyone to communicate and relate to the information presented, in a format that’s more akin to a conversation.
Currently, Production Monitoring requires some level of training to understand and respond properly. This creates a long series of technical bottlenecks, points in the process where nothing can move forward even at the initial level by anyone other than a trained professional. Beyond even the most precursory level, this individual needs to be very highly skilled, and to sink a lot of these precious skills into analysis.
Crafting a Production Monitoring solution that would make monitoring and alerting legible even to a person of the most minimal training could have a drastic and disruptive effect on this field. By allowing more people to take part in the monitoring and alerting process, you are in fact releasing and enhancing the invaluable resources of your technical professionals. Any resources previously allocated to analysis of raw or nearly-raw data can now be directed towards creative work, problem solving and the continual growth of your business.
Consider also that time is not the only resource saved in this scenario.
Will is a finite resource. So is attention. Having information relayed to employees in a manner that relieves stress from their intellect enhances their ability to focus on what humans still do better than machines - creative work.
Finally, as Artificial Intelligence makes leap after leap, the logical next step after alerting, is resolving. As complex systems begin casting their analytical skills in wider and wider berths, so do they become adept enough to successfully self-correct. Eventually, in the not-so-distant-future, technology will release us from these shackles entirely. Then, we will take a breath, gather our strength, and start working on the next big leap in humanity’s future.
About Loom Systems
Loom Systems Ops is the next-generation AI Operations Analytics solution acts as your team’s artificially intelligent team member; endlessly patient, it monitors your entire stack 24/7, predicting possible issues and reporting existing ones in clear, plain English.
Loom Systems Ops saves you countless hours on monitoring and analysis by weeding out the irrelevant noise, while predicting and notifying you of everything that is of real value to your business success. Schedule Your Live Demo
Loom Systems delivers an AI-powered log analysis solution to predict and prevent problems in the digital business. Loom collects logs and metrics from the entire IT stack, continually monitors them, and gives a heads-up when something is likely to deviate from the norm. When it does, Loom sends out an alert and recommended resolution so DevOps and IT managers can proactively attend to the issue before anything goes down. Schedule Your Live Demo here!