<img src="//bat.bing.com/action/0?ti=5739181&amp;Ver=2" height="0" width="0" style="display:none; visibility: hidden;">


No Excuses -3 Top Best Practices For SRE's

Understanding the context

Ben Treynor, a site reliability tsar in Google, described SRE as “what happens when a software engineer is tasked with what used to be called operations.” It’s a discipline that ensures the development of ultra-scalable and highly reliable software systems. It is more like a set of paradigms that operate as a backbone for the company to make sure customer satisfaction remains the core of its operational philosophy.


We aren’t only talking about optimizing latency or performance here. SRE is about being equipped with the right tools, techniques and strategies here. So let’s look into some of its core factors to see what makes a great SRE.

System Performance=customer success?

Let’s talk about customers first. Have you ever wondered if Samsung had never launched the Galaxy series, would they have been a prodigy for the tech world? What exactly happened behind the curtains focuses more on how they were actually able to stand firm for their customer by moving muscles on the quality of smart phones they make every year. We are talking about a whole different operational style that ensures that every operation leaves a bold mark on customers and their satisfaction.

Companies like Dropbox, Netflix etc that sell products/ services to customers have functions responsible for managing the customer fulfillment; An optional relationship between the vendor and customers. By optional, it means owning every aspect of the product development.


We are responsible to deploy our code, monitor it, be on call, and put water on fire when required. We do it all ourselves as if there is no SRE or we are our own SREs (Google Engineer).”

linked with system performance. The primary concern is to augment user experience to decrease costs and raise revenues. So you need to focus on making the system as efficient as you can.

Consider working in an SRE team for a search engine organization which is trying to tackle bots that are causing a huge deterioration in site search results. Will you not tend to focus deeply on a cost effective algorithm for system security that can deal with thousands of bots everyday without crashing?

We are talking about how well your system has to behave in order to avoid the company from draining massive amount of money on something that wouldn’t be beneficial for your users! Knowing that millions of web hosts are connected through your search engine, you make it fault-tolerant by working to improve performance and hence, develop the “trust” within your customers so they can rely on your services in the future.

Performance metrics provide great feedback on customers support and their demands. But it is still important to know when the users are affected.

The figure on the left indicates the funniest reactions of people when the Note 7 issue came out. So always know when it’s broken. Metrics are not always enough to cater customer success.

Everybody can make a mistake.

When you are working in a team, it is quite obvious that a disaster can occur anytime. What if someone quits? Or what if their technology crashes? This can easily be blamed on another member but what if you yourself have deployed some code or made an unwise decision that causes the whole system to break at some point where all your desired output gets scrambled up into a state that you can’t get out of?

To make sure this doesn’t happen, our work should be organized into clusters. The aim is to experience the most powerful opportunities for improvement at every iteration point. This means we need to make the best out of the time we have and utilize our resources in the most effective way to identify the factors that put you at risk.

Note that clusters:

  • Are easily manageable
  • Add value to work performance
  • Manage themselves by identifying the right expertise.

Assume you are working for a client on a site and he has provided you access to a hosting service consisting of 5 domains. You have decided to delete all the server files to reinstall WordPress. Now because of this, all his other email/domain files get deleted and his domains suddenly disappear. You also forgot to make backups. This is a situation where you could possibly get bailed for.

So wouldn’t it be wise to work in multiple chunks and deploy gradual changes instead? If you had a team with each member working on some module, it would have been so much easier.

Mitigate the Risk!!!

SRE entails developing contingency plans to address risk. Ever-changing technology may result in a lack of training and knowledge which may lead to improper use of new technology which leads to system failure. Often, modifications to user requirements might not always translate to functional requirements, leading to critical failures of a poorly-planned system.

However, through risk management, you can anticipate problems that may occur and formulate contingency plans in the event the risks become real.

You can be wise enough to use stress testing where the system is tested in the worst-case scenarios to determine its effectiveness in real world implementation. It can range from flooding the system with user traffic to injecting fierce malware.

Say you are part of an SRE team responsible for developing a CMS for an institute. The target audience is the students and faculty of 15k people. A day before the result, the CMS will be bombarded by users; faculty who is uploading the test scores and students to check their scores. It is crucial to understand how easily the system may crash so it can be tested by increasing user load and checking its performance to see where it breaks.

Hence, regular risk monitoring can be integral with most SRE activities. This means frequent checking during project meetings and critical events including publishing project status reports and system logs.

Also ensure maximum communication is being kept in between the team so that all potential blind spots are erased.

If you can identify, assess, prioritize, and manage all of the major risks and do something in the present to mitigate them, then your project will have a higher chance of success.

So, will SREs follow the same path of rapid growth that data scientists did before them?

Yes it shall. This new norm has dug in its roots around the emerging IT stack and has been playing a crucial role in highlighting the success of thousands of companies ever since and has yet to have a leading impact on the tech pool as it the stems the most important practices of producing the most reliable products.



Loom Systems delivers an AIOps-powered log analytics solution, Sophie, to predict and prevent problems in the digital business. Loom collects logs and metrics from the entire IT stack, continually monitors them, and gives a heads-up when something is likely to deviate from the norm. When it does, Loom sends out an alert and recommended resolution so DevOps and IT managers can proactively attend to the issue before anything goes down.
Get Started with AIOps Today!


New Call-to-action

Measure ROI from IT Operations Tools



New Call-to-action

Gain Visibility into Your OpenStack Logs with AI



New Call-to-action

Lead a Successful Digital Transformation Through IT Operations


Looking for more posts like this?