In the past, the production monitoring field was dominated by a small group of corporate giants - enterprise suites of vast monitorial width, each designed to be the go-to monitoring solution to end all others.
Instead, their expansiveness was their undoing; in an age of scalable, virtualized, cloud-based services, they were simply too cumbersome and rigid to keep up with the needs of their own customers.
Recent years have seen an explosion in the amount and variety of tools designed to address companies’ production monitoring needs. The overall number of these tools more than doubled in the past 5 years alone.
These boutique, do-one-thing-but-do-it-well solutions were designed to be a part of an ever extensible, ever modular new form of a monitoring system. Now every monitoring system can be scaled, squeezed and reinvented to fit whatever needs its creators deem necessary.
To break down the variability of these tools, let’s note the distinct services they might provide:
- System monitoring
- Log management and analysis,
- Time-series databases
- Anomaly detection
- Event processing
- Application performance management (APM)
- Web access monitoring
- Error tracking
And this just to name a few types of solutions, with more tools developed to answer more distinctly specific functions.
As soon becomes apparent, a company seeking to find an extensible, flexible monitoring solution needs to research, select, configure, maintain and monitor several of these tools at the same time in order to receive adequate coverage of their IT stack. Some tools provide insights into analysis and resolution, others focus on collection, aggregation and presentation of data, others analytics, and so on.

Ultimately, whatever machinery of tools you choose to implement can be roughly divided into 2 broad categories: Alerting tools and Visualization tools.
It’s worth taking the time to make a distinction between the problems faced by these 2 types of tools. Both of these have their own unique advantages and disadvantages that should be noted when building a monitoring stack.
Alerting Tools - The Bad Cop
1. Nobody likes them
Working with a tool you inherently dislike carries a slew of negative affects on your work, and - let's face it - alerts are not very likeable.

They're intrusive, they're aggressive, they're exhausting, often inscrutable and rarely bear good tidings. And that's just for starters.
2. We’re doing it wrong
It's weird to think of such an elemental part of many of our monitoring environments as causing so much damage. Don't get us wrong, alerting in itself holds many, many advantages as a monitoring tool. However, their execution is often sorely lacking. Bad implementation - too many alerts of inconsequential nature going out to the wrong people at bad times - means incoming alerts are often ignored offhand. When an alerting system is untrustworthy, as many alerting systems are, it comes inefficient at best and more often plainly damaging. When almost the only alert that delivers actionable information is an angry phone call from an end-user, something is seriously wrong.
3. Require continuous maintenance
Alerting systems require extensive configuration when the system is first installed, and henceforth demand continuous attention, maintenance and care.
Thresholds and rules need to be continuously monitored to make sure they're up to date on current system specs. Sadly, this ongoing investment of additional resources is actually the best case scenario. More often than not, what ends up happening is you're stuck with...
4. Out-of-date & legacy configurations
At times of perceived operational efficacy, resources are rarely directed towards maintaining the system at peak precision, and monitoring teams rely on alerting systems that are becoming more and more outdated by the hour. Teams are usually alerted to the necessity to update outdated configuration only when the monitored specs have altered to such a degree that the alerts are way, WAY off - meaning, AFTER calamity has already struck. When the wind blows down, it's back to business as usual, with the system again becoming more and more inefficient as time goes on.
5. They’re no good as decision support systems
The problem is not with what alerts are or aren't - the problem is with how people handle alerts, or what they think alerts ARE doing when they're actually NOT.
Alerts are not a good decision support system, but they are often treated as such.
6. They encourage reactive behavior
The response to alerts will always come after-the-fact by its nature. It keeps the monitoring team at a reactive state, where they are waiting to be notified of issues after they've arisen, rather than proactively pursuing ways to improve and detect different system behaviors.
7. Uninspired variations on how to handle them
Many tools do not offer a wide enough range of actions to take when handling alerts. There are many ways to resolve an ongoing event, and systems needs to give the user the right toolset to handle this range (example: systems that allow acknowledging an issue but not bouncing back in case you want to get back later). In other words, when you're holding a hammer everything looks like a nail, and when your range of options stops at "Resolved" and "Dismiss", you're missing out on a whole range of possible ways to tackle the problem.
Visualization Tools - The Good Cop

Hey. We love visualizations. We love graphs. We love pies. The human brain reacts very favorably when presented with visually orchestrated information. Seeing information in tables, charts and graphs can spark in us stokes of structural analytic insights that would otherwise never happen.
In the world of Production Monitoring, if alerts are the bad cop, visualizations are the good cop.
Visualization tools FTW, right?
Sadly, no.
Having a system that can display all the information in your production environment sounds great on paper. However, this tool opens up a slew of new problems that need tackling.
Visualization tools, in general, have 2 major problems: execution and burnout.
Execution
Visualization tools offer different variations of dashboards, where a user can monitor, ask questions and have almost any information available at his fingertips. At the surface level, this would solve many of the problems we pointed out regarding alert-based tools. Indeed, visual tools offer many comparative advantages - when they are handled by curious, energetic and system-savvy individuals. However, it's a simple fact that not all monitoring teams are or could ever be comprised solely by super-analysts, and even then, even super-analysts can't be super-analysts all the time.
What you end up with then is a system with virtually limitless options for fun tinkering - options which create an illusion that such tinkering is a) actually taking place, and b) a good idea.
As these systems depends on reporting issues, so they need to show off - if a system only alerted you once a day you'd feel cheated. And if a visual aid didn't show all the information, you'd think it was shortchanging you.
So a lot of the information that would otherwise not get retrieved, analysed and presented - is. And now you're left to stare at it the better part of your day, trying to make heads or tails of it all. As such it is not really useful as it creates more work at the guise of helping sift through the information.
Burnout
We will cover the topic of alert fatigue in a future post, but in short: Attention Fatigue, a state characterized by a growing indifference and blindness to newly presented information, due to information overload.
In the field of visualization tools we're faced with 2 other ways in which our cognition is constrained - the development of Decision Fatigue, and Analysis Paralysis.
Visualization tools promise to give us full access to our monitored systems. This intuitively seems like a good idea, the more you know about your production the better... right?
Well, not quite.
From Wikipedia:
"In decision making and psychology, decision fatigue refers to the deteriorating quality of decisions made by an individual, after a long session of decision making."
"Analysis paralysis is an anti-pattern, the state of over-analyzing (or over-thinking) a situation so that a decision or action is never taken, in effect paralyzing the outcome. A person might be seeking the optimal or "perfect" solution upfront, and fear making any decision which could lead to erroneous results, when on the way to a better solution.”
Do you see what we're getting at here?
The overabundance of options offered by visualization tools, the alluring possibility of digging deeper and deeper into the information in the hopes of gleaning new insight, could in fact be a dangerous and misleading monitoring tactic. These tools are too focused on giving us ways to tinker with information in increasingly convoluted ways, instead of solving the problem that got us tinkering in the first place.
What's more, these tools need to be created and maintained by people who know what are the right questions to ask (no easy feat, as we've discussed at the beginning of this series "check"), and how to read the information in correct and relevant ways (meaning, the engineers in question need to have a high level of technical skill).

Considering all the above, we realize that most monitoring tools fall into similar pitfalls, of either jerking their users around or shoving "value" down their throats to the point of asphyxiation.
What users need is something else entirely: what we need is an Alfred to our Batman. An invisible problem solver that doesn't add to the general noise and helps get things done without getting in the way.
When considering using a monitoring tool that's either an alerting or visualization tool, it's important to keep in mind that both types create new needs that spring up while trying to solve an entirely different need altogether - providing your customers with a stable and useful service.
The takeaway: keep your eyes on the prize. Don't let your monitoring tools become another thing you have to worry about.
Loom Systems delivers an AIOps-powered log analytics solution, Sophie,
to predict and prevent problems in the digital business. Loom collects logs and metrics from the entire IT stack, continually monitors them, and gives a heads-up when something is likely to deviate from the norm. When it does, Loom sends out an alert and
recommended resolution so DevOps and IT managers can proactively attend to the issue before anything goes down.
Get Started with AIOps Today!