SRE

So you want to be a Site Reliability Engineer?

March 22, 2016 | Beerit Goldfarb
So what does it take to be a SRE?

 

Site Reliability Engineering (SRE) is often perceived as either a variation of DevOps methodologies or as a personality quirk of overachieving Systems Administrators. The truth is quite far from both (mis)interpretations.

 

System Administrators is an IT operationspositionwhere one designs, builds and maintains an organization’s computer infrastructure. DevOps is amethodologymeant to build a healthy working relationship between the operations and development teams.

 

SREs solve a very basic problem sysadmins and DevOps do not: how to incorporate scalability, reliability and high availability right into software code.

 

Software engineers write code with the intention of accomplishing a specific task. But they don’t always have scalability in mind for this. Their code should also always be able to scale, but nobody is in charge of making sure of this. This is where SREs come in.

 

Because sysadmins are often in charge of infrastructure and scaling, site reliability is often regarded as a sysadmins responsibility. However, SREs need to be classically trained in computer science in order to make sense of the software code. A vast majority of sysadmins are less formally trained and can’t handle these tasks.

 

SREs spend half their time in dev writing code, and the other half in ops ensuring its reliability. By eliminating scalability issues at the code level (rather than at later stages), SREs are in many ways the perfect combination of dev and ops, accomplishing what DevOps intended.

 

SREs set clear, mathematically modeled service-level agreements (SLAs) that set thresholds for release stability and reliability. They don’t only find issues, but solve them on the spot. As such, they are very well received by people from all departments.

 

Think you got what it takes to be an SRE?

 

❑ You have a firm understanding of Computer Science fundamentals.
❑ You’re a strong, well-rounded engineer.
❑ You love debugging (and are good at it).
❑ You know your toolkit inside out.
❑ You can read code in your sleep.
❑ You have a firm grasp of complex systems.
❑ You have strong analytical skills and intuition when it comes to solving problems.
❑ You learn from your mistakes, as well as other people’s.
❑ You’re a team player through and through.
❑ You like the adrenaline rush of fast-paced work.

 

If you answered yes to more than half of these, it’s fair to say you’re on the right track. Spend some time honing your computer science fundamentals and gaining experience. Run a website, run a server, and write web apps. Learn as much as you can about IT needs, running servers and applying this understanding to running a server farm on Hadoop. Experiment, network with other engineers, create user-facing services and generally just geek out as much as possible.

 

Once you do all this (or if you have already), check out the below overview of SRE core concepts. If more than a couple of these aren’t sitting too well, start getting acquainted.

 

SRE Core Concepts

 

General:

  • Failure modes, and especially SPOF (single point of failure). Eliminating SPOFs is your greatest challenge – and pleasure – as an SRE.
  • Infrastructure components, from applications to hardware (servers, switches, routers, Internet connectivity, firewalls, ISPs, Internet routing (BGP), IPS systems, etc).
 

Application level:

  • Application load testing, memory leaks and breaking points.
 

Server level:

  • High Availability and system failovers. How to make a system fail gracefully, without losing transactions and remaining stateful from the end user’s perspective.
  • Backup systems.
  • Hard drive reliability and failover (including RAID features). On a data center level, you should consider disaster recovery (ensuring failovers to a different location).
 

Security & Management:

  • Understanding different types of cyber security attacks.
  • SLAs – saving the best for last, SLAs (service level agreements) are one of the most important aspects of an SRE’s work. Setting, monitoring and enforcing SLAs will take up a large chunk of your work.
 

About Loom Systems:

Loom Systems Ops is the next-generation AI Operations Analytics solution – the first fully automatic solution to derive insight from unstructured data. Loom Systems Ops acts as your team’s artificially intelligent team member; endlessly patient, it monitors your entire stack 24/7, predicting possible issues and reporting existing ones in clear, plain English.

 

Loom Systems Ops saves you countless hours on monitoring and analysis by weeding out the irrelevant noise, while predicting and notifying you of everything that is of real value to your business success.

 

Want to learn more? WatchLoom Systems Ops in Action.

Tags: SRE IT Operations SysAdmin

Looking for more posts like this?

 

New Call-to-action

Measure ROI from IT Operations Tools

 

 

New Call-to-action

Gain Visibility into Your OpenStack Logs with AI

 

 

New Call-to-action

Lead a Successful Digital Transformation Through IT Operations