History of SRE: From far 2003 to Plans for 2021 and Beyond
SRE is what happens when you ask a software engineer to design an operations team.
Benjamin Treynor Sloss, VP of engineering at Google
The most popular term-abbreviations “SRE” and “DevOps” continue having ripple effects on many experts’ minds evolving numerous polemics. Although site reliability engineering (SRE) was mentioned for the first time in 2003 and heavily used by gigantic IT influencers like Google, Netflix, Amazon, Dropbox, Alibaba, Meituan and Tencent, it’s often considered a brand-new, if not an unexplored approach. Not to mention the fact that many do not see a clear difference between SRE and DevOps. As such, we decided to sort out all the junk and discuss SRE: what it means, its origin, main principles, best practices, the difference between SRE and DevOps, and what SRE engineers do. What’s more, we will uncover SRE’s 2021 situation and plans for the future.
Who invented SRE and why
In 2003, Benjamin Treynor Sloss, a programmer at Google who invented and coined the term “SRE”, was placed responsible for running a creative group of seven engineers. The motivation behind this group was to ensure that Google websites were accessible, solid, and as failure-proof, as expected. Since Benjamin was a developer himself, he designed and organized the work of his group the way that engineers spent 50% of their time dealing with operation assignments, usually manually. Benjamin wanted them to experience how their products perform in production. Later Treynor explained that a major bottleneck in SDLC (software development lifecycle) was that development and IT operations teams had different goals and fulfilled different assignments in isolation while working on one and the same product. The development team, basically, works to set up new features taking into consideration how customers adopt them. And, to the contrary, IT operation specialists mainly focus on and guarantee service reliability. Since each team had separate objectives, it was rather hard, if not impossible, to get business goals fully accomplished. Treynor changed that when he had bridged the gap between software development engineers and system administrators by administering developers’ attitudes into the operations field.
SRE is … when dev is all about system’s reliability
SRE is an approach that highlights aspects of the software development process and attributes them to operations matters aiming at the creation and maintenance of an extremely reliable software system. Although Google originally created SRE, Netflix has taken SRE higher than ever in respect of practice, as only with the help of 10 SRE specialists they easily perform operation and maintenance (O&M) for services with thousands of microservice instances in about 200 countries all over the world. Other recognizable brands such as Amazon, Reddit, Alibaba, Meituan and Tencent eagerly endorsed this successful approach and set up SRE teams. However, this doesn’t mean that only cloud-based and SaaS companies have adopted SRE. Still, other companies: on-premises, cloud-planted, and hybrid are progressively embracing this new capacity for their software production teams.
Principles of SRE
- The main principle of SRE is that software development approaches can and should be applied when resolving operational issues in prod.
- One more important SRE principle is to set up a service level objective (SLO) which is composed of service level indicators (SLIs) for each service in form of a service level agreement (SLA) just to track and measure your performance.
- And finally, the principle of “error budget.” SRE engineers/managers verify the code quality and set SLOs to calculate the performance of app changes. Also, they distinguish a threshold for tolerable downtime (error budget). If downtime is within the error budget, leaders approve it. If not, the changes have to be undone and improved for them to fit into the error budget limits. You can calculate the error budget here.
Wait a minute! Is SRE DevOps and DevOps is SRE?
18 years ago Google guys distinguished one major stopper in the software production cycle – two key teams working on one product pushing things in opposite directions. The developers aimed at creating new features and driving them to production environments as often as possible while the IT operation specialists were focused on keeping production stable and reliable. If we add diverse backgrounds and skill kits to this, you get the picture — slow, inefficient process and constant finger-pointing in case of an issue. With the problem detected, Google decided to change production administrative methods by creating a team of system administrators with a development background and mindsets; these guys were called SREs (site reliability engineers). SREs are focused on maintaining the balance between pushing new features to the production environment and keeping its stability and reliability. So, now let me ask you this: How can SRE be DevOps and DevOps SRE? Yes, both of these methodologies are here to eliminate the bottleneck in the software development and delivery processes. However, DevOps is mostly about streamlining, optimization and automation SDLC by improving collaboration between teams and utilizing particular practices and tools. DevOps is not about developers working as system administrators, not at all, everyone keeps their roles. DevOps is more about rolling out new features as often as possible with the help of properly automated CI/CD pipelines than instilling resilience of new updates for them not to harm production reliability and availability. As you can see, SRE and DevOps are different, although some experts disagree with that statement. One way or another, the main point here is that DevOps and SRE being different doesn’t mean they cannot complement each other or be used separately in various situations within one organization.
SRE engineers – Who are they and what they do?
While IT operation specialists focus on running an infrastructure they’re given by developers and troubleshooting every incident, and DevOps engineers focused on automating different Ops aspects to cut the number of failures, SREs are centered on planning, designing and updating an initially resilient infrastructure.
The main job responsibilities of SREs are:
- Collect specifications from partners alongside BAs and PMs
- Plan significant level design of the infrastructure including means and flows
- Conduct a top-to-bottom investigation of risk capacity
- Calculate the possible expense of blackouts and arrange for contingencies
- Monitor and inspect the infrastructures in prod
- Prepare reports on infrastructure/flow/tools etc. updates for the team
- Work with teams and teach them to adhere to certain regulation to cut the bulk of failures down
This list might be much bigger and depends determinately on your organization’s peculiarities. By the way, Google experts have prepared not one but two books devoted to SRE and site reliability engineers’ roles and responsibilities, which you can find here for free.
5 top predictions for SRE
Without any delay,
Prediction № 1 — a massive hiring wave for SREs is coming. It started actually in 2019-20 but will definitely go on in 2021 and beyond as according to LinkedIn, the demand for SREs has already seen 34% growth.
Prediction № 2, ironclad logic, adoption of SRE will only continue to grow. As with any adoption and implementation of something new, it should be thoroughly studied and planned. There is no all-fits-everyone solution so each organization’s leaders have two options: either go through piles of information and spend hours/days/weeks (we can continue) on planning and hiring SREs or turn to outsourcing companies that provide SRE services. The second option is one of the best choices for SMBs or startups which experience a significant lack of skilled professionals and time for that matter. One way or another, it will not hurt to consult with experienced SRE providers to answer your burning questions.
Prediction № 3 — practice and cultural shift will be prioritized in 2021 and beyond. And this means that SRE and DevOps will have to unite their forces.
Prediction № 4 warns us that more people (not only SREs) will focus on the reliability of infrastructure, which means that the reliability and availability mindset will enter all the phases of SDLC.
And finally, prediction № 5 tells us that SLIs, SLOs, and error budget policies will become routine procedures. If SRE history could be described by ages like in Björn Rabenstein’s “SRE in the Third Age” theory, SRE will soon enter the third age where we will not require reliability engineers. We will require SRE as a general rule for any SDLC!
Final thoughts: Do we need SRE?
To wrap things up, we would like to list top SRE benefits for the business. SRE helps to:
- Accomplish, meet and often exceed customers’ expectations on the functionality and not only
- Increase the reliability and availability of the systems
- Reduce failure rates and downtime
- Prevent bugs/errors and quickly reboot the systems, if anything
- Reach production goals quickly and more efficiently
- Automate processes for reliability to save time for teams
- Increase guarantee levels
- Prevent conflicts that might arise.
Whether you need SRE or not is up to you, but to answer this question you, at least, should know what you are dealing with. We hope that this article clarified some points about site reliability engineering you might weren’t aware of, but if you have more questions, you can contact us anytime for us to consult you on any SRE/DevOps-related issue you have.