As expected, the Microsoft Ignite event in Toronto had an impressive turn out. Thousands of Microsoft users showed up at what turned out to be a great event for those looking to do a deep dive on all things technical in the Microsoft universe.
The content was, as could be expected, quite good from a training perspective. And the conversations we had at the booth largely centered around topics related to one of the sessions I am going to focus on in this blog article today.
OPS30: Learning from Failure
For those of you interested, this is actually available on the Microsoft Ignite website, you can watch the entire presentation here:
How Complex Systems Fail
The session kicked-off with a relatively sobering reminder of what we face in IT, complex environments that are prime for failure. Just the way you want to start your new year, right? A nice reminder that we in IT play a high-stakes game on a daily, if not minute-by-minute, basis.
So, the presenter highlighted key elements included in a paper written by Dr. Richard I. Cook entitled (appropriately), "How Complex Systems Fail."
Three key aspects from that paper were drilled home:
-
“Complex systems contain changing mixtures of failures latent within them.”
-
“Complex systems run in degraded mode.”
-
“Catastrophe is always just around the corner.”
What's an IT team to do?
Those three points, left the speaker with two clear options: either prevent a catastrophe or respond to a catastrophe. The crux of the remainder of the presentation had to do with responding to a catastrophe, which I will get into shortly, because it was great content. But, I think what was just mentioned calls for a pause and slight reflection. Two options. Prevent or respond.
As the old saying goes, "Prevention is worth a pound of cure." So, if we can prevent an issue from happening in the first place we are in a much better position. And, how do we do that? The most logical way (to the engineer in me) is to track slight changes across my system and, as something bubbles up, respond to it quickly.
But, Chris, isn't that what we've tried to do with monitoring tools from the start? Sort of. The problem with traditional monitoring is you rely on human set thresholds, which requires humans to decide what to measure and for a certain level to be reached before I am alerted to it. This was done because of natural limitations on technology at the time. And, that is the beauty of what artificial intelligence offers: freedom from that natural limitation. One might argue, that the most natural application of AI to IT Operations is to track all data points for the slightest change, evaluate that change, and alert when necessary.
The problem with traditional monitoring is you rely on human set thresholds, which requires humans to decide what to measure and for a certain level to be reached before I am alerted to it.
The focus should be on data points that are more leading rather than lagging. Lagging data points, would be metrics such as CPU, etc., coming from monitoring systems. A great leading data example is logs. The volume of information delivered and the ability to indicate the slightest change in behavior as prescribed by the developers or creators of the application, device, virtual machine, etc., is unparalleled when it comes to logs.
The volume of information delivered and the ability to indicate the slightest change in behavior as prescribed by the developers or creators of the application, device, virtual machine, etc., is unparalleled when it comes to logs.
Catastrophe - The Post Incident Review
The presenter made the point that all systems involve humans as well as machines and, that how humans respond when things go wrong is as important as preventing things from going wrong in the first place. Ultimately, this relates to a retrospective, or post incident review.
For this we follow 3 key points:
- For every significant incident we perform a post incident review.
- We do the review within 24 - 36 hours of the resolution (so it is fresh in our minds).
- We strive to learn and improve.
The great part about this presentation, for a pseudo-techie such as myself, was that it was a good mixture of theory and technical application. In other words, the presenter dove into the tools that could be used in the Microsoft world to facilitate a post incident review. I'm not going to in-depth on that as you can view the presentation I linked to above, other than to say that there was still a fair amount of manual correlation and grabbing of data elements to build a repository talking about the incident.
From an AIOps perspective, that level of work shouldn't be necessary. For instance, our AI engine, Sophie, automatically makes those linkages across different data points, bringing the graphs together to show how all of the different points build into one story. The grabbing of data elements is done already and, you are tipped-off well before there is actually an incident logged. Which means, in the end, you could close out the incident AND conduct a post incident review in the same amount of time it would have taken you to detect the incident in a world without AIOps.
Final Thoughts on Microsoft Ignite
As someone who has been in tech for a while, I am amazed at the transformation I have seen in Microsoft over the past 15 years. They've generally done a good job of packaging content in such a way as to help practitioners, but they are delivering tools and training at a whole other other level.
If you have the opportunity to attend one of the Microsoft Ignite The Tour events, I strongly encourage you to attend. You'll walk away with practical, hands-on tips for doing your job better and faster.
If you want to focus on both the prevention and response to incidents, I highly encourage you to register for a no obligation demo with Loom Systems.
Loom Systems delivers an AIOps-powered log analytics solution, Sophie,
to predict and prevent problems in the digital business. Loom collects logs and metrics from the entire IT stack, continually monitors them, and gives a heads-up when something is likely to deviate from the norm. When it does, Loom sends out an alert and
recommended resolution so DevOps and IT managers can proactively attend to the issue before anything goes down.
Get Started with AIOps Today!