The 3 major problems of production monitoring systems are:
- Exhausting manual configuration.
- Outdated toolset.
- Costly services.
Here at AI Joe we don't leave things to chance, so we've further broken these down into their base components, for great justice.
So whether you're a new company just starting to build its IT stack from scratch, or an established firm looking to improve its existing stack, use this list to try and avoid stumbling into the most common pitfalls.
Buckle up, it's gonna be a bumpy ride.
Manual Configuration Paint-Points
1. Knowing what questions to ask is HARD.
As simple as this may seem, knowing what to ask your monitoring system to monitor is a vastly complex question, dependant a nearly infinite variety of needs, technological limitations, contractual obligations and business compliances. Knowing what to ask your monitoring stack to reveal is a constantly morphing puzzle, requiring much skill, patience and time.
2. The soul-crushing necessity for pre-processing.
Your IT stack is likely to be a mishmash of language runtimes and operating systems. However, underlying this, a lot of its raw data may actually be quite similar. Yet precious time is spent translating it from format to format to make sure your monitoring system can ingest all the different inputs.
3. Deciding who’s in charge.
Selecting the right tools, metrics, logs, apps, hosts, infrastructure and networks can keep even the most talented monitoring teams scratching their heads. Building such a team and stack is no easy feat, as it requires a wide range of technical expertise and, more importantly, the time to properly set up and maintain.
4. Everyone thinks and works differently.
Any system that depends on manual input will always be subject to the writer's whims. As such, often differences arise in spacing, naming, and spelling of metrics and other heuristics. These inconsistencies are a kiss of death to any well maintained system. Investing resources in setting up and policing a uniform workflow is great! It's also one more thing you'll have to keep doing - forever.
5. Hosted monitoring is a drag.
Unless the tool you’re using is client-based, your monitoring will need to go through firewalls, require the configuration of VPNs and/or ACLs, and numerous changes to the hosts’ configuration. All this need constant supervision from you security team to make sure the right things are getting in and out, which leads us to…
6. More teams = more bottlenecks.
Once you do want to integrate a new tool or update an existing one, the system misses a lot of beats. All tools are dependant on the graces of developers to get new metrics put in place and to have a consistent set of metrics. When needs change, as often happens in high growth services, time and attention are diverted from working on higher value tasks. This makes changes inflexible, causing you to reconsider making it in the first place (even if it’s very necessary). The longer the road the more can go wrong.
Outdated Toolset Paint-Points
8. Tools are too rigid, can't learn.
Many tools rely on boolean checks and thresholds: these are static configurations, and highly inflexible. Systems scale and evolve so this mode of operation must be replaced by pattern recognition technologies.
9. Poor granularity, low resolution, low performance pollers.
Many tools can only focus on system down, or binary yes/no situations. Tools need to bring more grayscale into the mix without compromising proper exposure of information to user. Meaning, they can report black, and they can report white white, and and as many shades of gray in between.
10. Compromising sensitivity and specificity.
Once you’ve had to define what it is you want your monitoring system to do (in itself a problematic necessity, as we’ll discuss), it needs to be able to correctly identify what does and does not fall under these categories. Failure to do so with extreme precision results in wasted time and effort. The balance that needs to be struck between sensitivity and specificity (the ability to identify what needs to be reported, and what needn’t be reported, respectively) is at the heart of all the configuration and tweaking performed by your monitoring team, and probably THE biggest question facing developers of monitoring tools and stacks.
The fact that all this depends on your team’s tinkering is, honestly, asking for trouble.
11. Too little integration.
Many tools have missing or poor integrations (APIs) and are not easily extensible. These tools were developed with a lot of offhand assumptions about the system’s they’ll be monitoring (sometimes using proprietary or esoteric components), which forces you into one of two unpleasant scenarios:
a) Either add to and adjust your existing system until it fits the criteria of the new tools you want to implement; or
b) Choose tools that aren't necessarily the best for your needs, simply because they stack well with your existing system.
* For a schizophrenic overview of this issue, refer to post #4 in this series, to the subsection titled "Too much Integration".
12. They’re not timely.
A lot of tools are not quick enough with reporting and analysis. Latency issues are incredibly costly and painful to any organization.
13. They don't give a holistic solution.
All tools either just monitor for bad things, or let us know when beneficial trends stop, or predict (often poorly) potentially harmful things, or give us suggestions for improvement, or give us a mix of these, but no tool delivers the full package, forcing you to use a multitude of tools.
14. They can't handle the truth!-... That is, logs.
One of our most precious resources of information is going to waste: logs.
Logs contain vast amounts of the most relevant information to be monitored; they are often the first thing analyzed by a human once a problem is detected. Most currently available solutions can handle metrics very well, but they all have a serious problem handling logs. The reason behind this is that logs are always designed with the intent of being human-readable at one point or another, which makes them hard to parse by anyone other than a human.
Hard. Not impossible.
15. They offer poor storage and data collection.
Many tools handle metric collection very poorly, throwing out the data after providing alerting. Retention is poor due to legacy configurations that don’t take into account the explosion in cloud storage and big data analysis that happened in recent years.
16. Persistency and resilience are lacking.
Most tools do not provide enough persistency for network issues or hosts down issues, and information is only available as long as there is connectivity.
Our systems are not resilient enough; they’re often centralized, so outages leave crucial metrics trapped inside failed agents and pathways. We must move away from this into a solution that is impeccably stable.
17. There’s an over reliance on end-user’s reporting.
Problems that are reported by the users affected by them are harder to pinpoint - are they network issues? Server? Database? Virtual Machine? What does the reporter mean when he explains his experience? This
18. Mo’ tools, mo’ problems.
As we said, most companies need to implement several monitoring tools to ensure full coverage across different parts of their stack: alerting, dashboards, developer metrics, service metrics and logs. Consequently, our monitoring systems are a tangled mess. Even when they're not, more tools equals more training, more decisions that need to be made regarding each of these tool, and more links that can break in the monitoring chain. A long chain can, and will, break in more places.
19. Micro-services are a wolf in sheep's clothing.
Moving your monitoring solution to a micro-services-based stack may look like it's the end-all answer to maximum flexibility, but keep in mind that is can cause an exponential explosion of monitored logs, a lot of which need to be pre-processed. Again we see a so-called "solution" which ends up creating more work.
20. Overabundance of options.
There is an explosion in recent years of monitoring tools that enable you to monitor every part of your IT stack. However, this overabundance can also cause confusion and fatigue, which ultimately may lead to inaction, or inelegant action. This issue will be covered from several directions later in this series.
21. UI is wonky.
Existing tools are by no means intuitive or easy to use. There are too many clicks to get anything done, the UI is outdated (for example, some tools offer no pagination). This makes training new users a very time consuming operation.
22. They give unactionable information.
Many tools present - by misguided system design or problematic configurations - information that is ultimately unactionable. This creates a psychological strain on the users and costs them precious time.
23. They give non-contextual information.
Many tools don’t understand state changes. Alerts gathered by these tools are often technically complex (and therefore time consuming to comprehend) and unprecedented on the system. Without context gathered from a continuously adaptable system, this information is is useless. Information need to be presented correlatively - instead of leaving that up to humans.
In addition, most tools treat all data equally. But not all data is created equal, tools need to be able to handle different data in different ways without exposing the user to a confusion overabundance of options.
24. Developers aren’t exposed to production.
Most tools offer developers very little insight into the production environment running their code. This is one of the major pain-points of the DevOps field.
Costly Services Paint-Points
25. (very) Slow time-to-value.
Other than being costly in and by themselves, many tools provide slow time-to-value because of immense configurational needs, covering many teams, tools that need integrating, working procedures that need changing, etc. Thus, tools you've meant to install to save you time end up sucking away days and weeks in configuration before they even get off the ground.
26. Often demand additional costly services.
Many tools promise to deliver certain things, only to end up costing extra in support, professional services, and crucial additional features. These hidden costs (some of which are a direct result of tools’ inherent complexity) can be a deathblow to a monitoring stack, as they end up diverting precious resources.
27. Their high price drives you to misuse them.
The high price of services leads many to use their tools in ways they were not meant to be used, or to avoid buying new monitoring tools to stay within budget.