We’re going to take a wild ride, 11 years back.
The year is 2006, and I was working as an analyst in the tech-ops division of a call center. My job, to monitor the network for funky things, and make sure every one of the 600+ call center reps was doing their job. Each project had different acceptable thresholds, and each rep their specific job.
My role basically meant that I was constantly in fight-mode. Make sure the projects are running, since the company's SLA depends on it. If not, we have to pay the customer for not being online to take calls. If there is a network issue, escalate it to the devops team (that's what they were back then, even though we just called them tier 3 support) so they can fix it now, or mark it down as a problem not within our control so we won't be billed for it.
I had a job with purpose. I walked the floor like I owned the place.
My first #oncallselfie in 2006
I knew that the business was dependant on what I did, and how well our team performed. Our team allowed all of the different call center projects to continue to produce revenue. Our job was to keep the business pulse beating.
In addition to the above, we also had to be in direct contact with our partners all over the world if we needed to be backed up or have additional resources.
One splendid Monday morning we came in to work as usual, only to find out that we couldn't connect to our remote systems. We knew about some routine maintenance over the weekend, but that should have been resolved already.
After a short check, we realized that we were practically at DEFCON 1. We were in an outage unrelated to the maintenance, and someone had physically cut the line somewhere in New Jersey.
This outage meant that we had to call the main NOC line and inform them of what was happening. This also meant that our main project of 250+ employees could not work. The first shift of 70+ employees were on standby waiting for this to be fixed. We thought, sure we can get this up and running in an hour or so. Just wait, and we’ll route the calls to one of our affiliates in Bangalore, or Makati, as the USA (MST) sites haven’t woken up yet.
While on this call, things went from bad to worse. We discovered that they knew there was a disconnect, but not where. Troubleshooting would take more time than we thought. So this conference call with all of the oncall techs continued.
For 4 days.
My shift started at 7AM EST on Monday, and the shift manager for the evening shift started at around 3PM. So when I finished my shift, and we still didn’t know what was going on, he took over and tried to see where things stood. It was like that for 3 days until they detected the location of the problem.
On days 2 & 3 we called the employees to make sure they didn't come in. There was no reason to have them come in to the office if we didn't know the root cause of the problem. On day 4 they found the issue and dispatched the right people to fix the cut cable that had crippled us and our team. Things were starting to look bright again. The problem was now fixed and the business had only lost northwards of $300K in revenue. Meh.
Granted, this was 11 years ago, and an issue like this would probably take about an hour to fix nowadays. We didn't have the most sophisticated systems. Pagerduty wasn’t even a twinkle in their founders eye (trust me I checked. Don't believe me?).
Not enough people owning too much - if no one is responsible, no one is accountable.
Systems visibility - if you can’t see enough of what the problem is, you will continue to get paged when things break.
Team visibility - if the higher ups don't know about the problems, how can they fix it at a higher level.
Bandaids on bullet holes - If you constantly put band-aid upon band-aid, you wont ever get to the root of the problem, and it will keep happening. Remedy the issue at the source.
Notification clean-up: Actionable alerts to give you a full picture.
Cluster alerts: No need to have 50 alerts. Just tell me once what the problem is.
Reasonable SLA’s: No human can find/fix an issue in minutes. So either fix your SLA, or have robots do the work.
Devs on-call: If the Dev is on call, he is your only chance for fixing the problem in real-time.
Another option is to use robots, as Alice already mentioned, to clean up the notifications and have actionable alerts that are clustered and correlated, including the root cause of the issue, in real time. Loom Systems can do exactly that, s o you can be getting paged and back to sleep in 3 minutes (after you’ve determined that it’s not your fault).
That just leaves you with having a Dev on call. Good luck with that