Reliability is not an option but rather a prerequisite for every system. If assumed as a luxury by any chance, one needs to rethink the priorities and place reliability at the top of the list. The value of reliability, resiliency, security, observability & scalability gave rise to the term called Site Reliability or Site Reliability Engineering coined by Google back in the 2000s.
Complexities in infrastructures are never-ending and are increasing every day, demanding practices that involve not just software development but a tinge on the operations practice alongside.
To achieve highly scalable & reliable systems, enterprises need to initiate practices that combine Software Development with IT Operations, and the outcome is Site Reliability Engineering.
Site Reliability Engineering hence is the practice of developing, automating, shipping, and maintaining software in production through implementation strategies which are a mix of what the development, systems & operations teams do.This discipline existed a long time before Site Reliability Engineering or SRE was coined, previously, it was known as a mixture of traditional IT engineering and DevOps.
It involves maintaining the right balance between implementing new features and catering to the system requirements following the old ones. As systems transition into a brand new phase with more functionalities, it is vital to manage them through solutions that are highly scalable and manageable while dealing with hundreds and thousands of complex systems.
All-in-all SRE practices help achieve reliability goals while adopting newer methodologies in this paradigm shift from traditional systems to newer Cloud-Native Systems.
Site reliability engineering helps organizations define the new features that can be launched and when to launch them by using service-level agreements (SLAs) to define the required reliability of the system through service-level indicators (SLI) and service-level objectives (SLO).
A Site Reliability Engineer is the persona practicing the principles defined under Site Reliability Engineering. A Site Reliability Engineer is responsible for ensuring that the code is correctly
- Deployed
- Configured
- Monitored
- Observed
- Managed
- Safeguarded
The above check marks help to achieve proper availability, latency, system management, change management, capacity management, and emergency response of systems while moving into being called production-grade or production-ready.
The following activities are carried out by Site Reliability Engineers as part of their overall goals:
Disaster Management:
Outages and Downtimes are pretty typical for
an SRE to face. But the uphill task at hand is to ensure that these outages are
managed in due time, and there exists proper preparation for many such potential
disasters as and when they happen. SREs need to put a disaster recovery plan in
place to avoid mishaps of any kind. Identifying disaster mitigation strategies
and automating the overall process to achieve success in each task under
discovery mitigation of possible disasters is crucial.
Incident response:
Production level incidents are pretty common and
can occur on any application at any point in time. With the induction of new
features and more modern methodologies in an application, the risk of incidents
grows exponentially, which might be a matter of concern while addressing client
requirements. What is essential for an SRE is to minimize the impact of the
production level incidents.
The metrics that are used to measure the speed and efficiency of incident response, such as:
- Mean time to detect (MTTD), which measures the average time needed to discover a problem
- Mean time to resolve (MTTR), which measures how long it takes to fix a failed system
- Mean time to failure (MTTF), which is the average amount of time a defective system can continue running before it fails; this is similar to uptime and helps teams plan for future replacement of system components before they stop working
- Mean time between failures (MTBF), which measures the average time a system or component is working properly
Maintaining SLAs, SLOs & SLIs:
To achieve the desired reliability
and maintain high standards, defining certain Service Level Agreements (SLAs),
Service Level Objectives (SLOs) & Service Level Indicators (SLIs), and
maintaining them are key. These factors help an enterprise understand a client's
requirement better and create a systematic plan to achieve the eventual
deliverables. To understand how SLAs, SLOs, & SLIs drive SRE best practices let
us define them first!
Service Level Agreement or SLA is the agreement or contract between the provider and the client featuring the metrics that act as promises to the client and consequences in case of failure on the provider’s side. If the client's expectations aren’t met then the potential consequences are penalties, extensions, service credits, etc.
Service Level Objective or SLO is the individual objective or promise under the SLA made to the client. SLOs include specific metrics such as uptime or response time, defined by specific values or a range of values, which define the client success percentage. They are driven mostly by the customer requirements rather than present performance.
Service Level Indicator or SLI is an indicator or quantitative measurement of the objectives defined under the SLOs. If an SLO is defined as identifying the response time then the value of the response time is defined as an SLI. Hence, an SLI is directly measured by the users and helps define what exactly needs to be measured.
Thus, defining and maintaining SLAs, SLOs & SLIs are the fundamentals to further define other major factors like error budgets, external factors, etc., and meet Site Reliability best practices.
Formulating an error budget:
To leave room for agility for the team,
one needs to identify failure as an inevitable occurrence in the future.
Formulating an Error Budget is an SLO which not only helps in innovation and
risk mitigation but also helps identify the extent of error your services can
incur before client dissatisfaction.
An error budget is defined as the maximum amount of time that infrastructure can incur failure without contractual consequences. To calculate the error budget, we have to use the SLI equation: SLI = [Good events / Valid events] x 100
Now the percentage is expressed as SLI, and once an objective is defined for each of the SLIs, that is your service-level objective (SLO), and the error budget is the remainder, up to 100.
An example of error budget formulation is that, if your Service Level Agreement (SLA) specifies that the systems will function 99.99% of the time before the business has to compensate clients for an outage, that means the error budget (or the time your systems can go down without consequences) is 52 minutes and 35 seconds per year.
Efficient Planning:
As a system goes through constant changes and
various complexities, organizations need to assess growth, spikes & possible
faults that can happen over time and need preparation. An effective planning
mechanism needs to be put in place to put out old & worn-out software, ensure
quality & stability, conduct timely security & version checks, and cater to
dependency needs. To prepare for these events, SREs need to forecast the demand
and plan time for reaction. Vital facets of capacity planning include regular
load testing and accurate provisioning. Conducting load tests regularly allow
visualizing system operations under the various possible usage scenarios. Also,
adding capacity can be expensive, so knowing where one needs additional
resources is the key to planning.
Addressing past issues:
Issues that might have possibly affected the
system in the past, have the potential to reemerge and impact the system again.
Thus, it is crucial to address the whys, hows & whens of previous issues which
have impacted the system as drastic outages!
Site Reliability Engineering is a must for every organization looking to adopt reliability engineering practices and enhance their DevOps outlook. People often confuse SRE with DevOps where SRE is the subset of DevOps but the other way isn’t true. SRE practices are all about improving system performance while adhering to customer expectations and requirements, adopting new risk mitigation and disaster recovery strategies while keeping monitoring & measurement as utmost priority.
Chaos Engineering is just an important factor running hand-in-hand with SRE practices to achieve the goals set by the management.
Enterprises need to keep patient and allow the process to bring in value. Site Reliability Engineering is much more than just reliability. It is about bridging the gap between the Developers & the Operations team. It is about analyzing data and taking action accordingly. It is about building a culture. The culture eventually helps achieve customer success and satisfaction.