Marketing leader Marketo had a challenging few days last week after failing to renew their domain name. A lot has been said about the recent Marketo outage (#marketodown #marketopocalypse) from the hilarious to the critical and the supportive. I have nothing extraordinary to add, but to tie it to the series of outages we covered lately in our blog (BA & Gitlab) and try to offer some takeaways to you!
So, in a nutshell, what happened?
On July 25, Marketo’s website was down from 7:41 am EST (4:41 am PST) through at least 9 pm EST (6 pm PST). During the next 48 hours, some users still couldn’t access the website.
Fortunately (or sadly, depends if you’re Marketo’s CEO) one of the Marketo customers, Travis Prebble, a domain name specialist, quickly figured out the problem – the company had somehow failed to renew its domain and its registration had expired. Travis acted quickly and paid for the registration and reinstatement fees, resolving the cause of the issue. In our business, having an end user to provide and solve the root-cause on an issue is a NO-NO!
Gradually Marketo’s services came back online as the change propagated through the domain name system.
Now for the takeaways:
1) Transparent and Timely Communication
To be fair, Marketo’s CEO, Steve Lucas, led the company’s communication at this sensitive time, bravely. Not at first, but over the course of the day, he provided timely updates, which reassured customers that the problem was the company’s highest priority. In one of Steve Lucas’s tweets, he updated that “Resolving DNS issues
2) MTTR: the Only Option is a QUICK One
Marketo is used by more than 9 thousand domains, which means that thousands of marketers were struggling (and not succeeding) to do their jobs. As with all software platforms, absolute minimal downtime, or no downtime, is critical to
3) Automation is Your Friend - Data Silo is Your Enemy
The tech world has become so complex that manual system management just doesn’t cut it anymore. When an organization has thousands of assets that need to be monitored and a limited number of staff members to do so, mistakes will happen. Automation and the ability to correlate issues between different IT assets, such as Applications, Infrastructure, Middleware and Virtualization Layers, have an enormous impact on the continuity of any business.
4) If You’re Big and Complex Enough, it Will Happen to You
Large-scale systems with a massive number of moving parts are going to suffer from “black swan” events — something terrible that happens unexpectedly, often triggered by something relatively minor, but with disastrous ripple effects. Companies such as Google, Amazon, and Netflix have also suffered outages and other disasters, showing that it can happen to the best-of-breed. You can (and should) put processes and technologies in place to master the impact of these events at scale.
In conclusion, looks like Marketo has weathered the storm and might come out of it even stronger. Not many companies have such a strong brand equity which can help them through such disasters. Even though the downtime had a large negative impact on customer operations, the community of Marketo showed the brand overwhelming support and was rooting for the company to get back up and running. Ultimately it’s how you handle the crisis once it’s happened that will have the lasting effect.
References
[1] Kieren McCarthy, Marketing giant Marketo forgets to renew
[2] Scott Brinker, 3 thoughts on Marketo’s domain outage this week, July 28, 2017
[3] Dayna Rothman, The Marketo Meltdown And The Holy Grail Of SaaS Stickiness, July 28, 2017
Loom Systems delivers an AIOps-powered log analytics solution, Sophie,
to predict and prevent problems in the digital business. Loom collects logs and metrics from the entire IT stack, continually monitors them, and gives a heads-up when something is likely to deviate from the norm. When it does, Loom sends out an alert and
recommended resolution so DevOps and IT managers can proactively attend to the issue before anything goes down.
Get Started with AIOps Today!