The Ultimate Guide to SLI/SLO
This is Hasegawa @rarirureluis from the Service Reliability Group (SRG) of the Media Headquarters.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
This article introduces some limited information from an internal document that explains the steps to take when joining a team as an SRE to establish SLI/SLO within the team, using specific materials and templates.
I hope this helps in some way.
Introduction0. Mental Preparation1. Gaining credibilityToil reductionCost reductionRegular attendance2. Organizing the alert infrastructureDatadog and Grafana exampleWhy IaC Alert Management?3. Introducing SLIs/SLOsTerminologyCUJ (Critical User Journey)SLI (Service Level Indicator)SLO (Service Level Objective)Error BudgetBurn RateFind team membersWhen doing it with membersIf you're doing it aloneStart smallDeciding on CUJDeciding on SLILatency and availability SLIs: separate or combined?Determine your SLOsSetting alerts and error budget burn ratesGrafana + Cloud Monitoring query exampleExample queries in DatadogFast and Slow Burn RatesAlert Noise) How to deal with frequent alerts during times of low request volume4. Fostering an SLO cultureTeam study sessionsFrequency of retrospective meetingsContents of the retrospective meetingRetrospective meeting templatecategoryPenetration to non-PdMs and engineersDefinition of PenetrationClearly define the goalKey words to use when expanding to non-PdMs and engineers5. Adjusting SLIs/SLOsHow to do a full SLO retrospectiveMake a list of what needs to be correctedUsage exampleThings to check firstAdjust SLOs or SLIs?SLI Tuning GuideWhat you need to adjust SLISLI adjustment examples by patternExternal APIs increase latencyBe careful with latency adjustmentsoverviewUnknown patternoverview1. Check if the SLI is correct2. Check the latency of the endpointMaking change history clearerActions yamlFlow from start to finishTipsWhat do Prometheus percentiles mean?Conclusion
Introduction
How did you all"The experience of the service we operate is gradually getting worse."Are you preventing this?
As time passes, the number of database records for an internet service may increase, worsening latency, or as the service grows, the number of requests may increase, resulting in high load and a poor user experience.
By implementing SLI/SLO, we can visualize these potential service degradations as quality indicators in quantitative numbers and respond accordingly.

In this article, the authorEmbedded SREI joined another team as aSLI/SLOThis page summarizes the procedures and know-how for implementing the system.
By reading this, you will get some tips on how to smoothly introduce SLI/SLO when you join another team as an Embedded SRE.
0. Mental Preparation
It may take more than six months from the introduction of SLI/SLO to the point where the team is autonomously implementing the SLI/SLO, so patience is essential.
Along the way, you will encounter situations where you can enjoy the benefits of SLI/SLO. Once you experience them, you can use those benefits as motivation to maintain them. For example, you can detect potential deterioration in service quality early on and have a common language with team members other than engineers.
1. Gaining credibility
If you join another team and suddenly start talking about SLI/SLO, the other team may not see the value and may think that it will just take up a lot of time and effort.
Therefore, a phase is needed to make both the development team, PdMs, and non-engineers realize that "this person is reliable" and "there is benefit in participating in this person's efforts."
Toil reduction
- ToilThis refers to the repetitive, tedious tasks and menial tasks involved in maintaining infrastructure and code.
- Once you join the team, the first thing you should do is understand the infrastructure configuration and deployment flow, and look for any toil you can immediately reduce.
- example:
- Cleaning up unmaintained IaC
- Accelerating the deployment flow
- Review excessive resource allocation
ToilReducing this is extremely effective in reducing the burden on development team members and creating spare capacity.
It will be easier to gain the trust of your team by making them realize that ``the amount of monotonous work has decreased since I joined.''
Cost reduction
- Cost reductions can be demonstrated as results that are easy to understand for non-engineers as well.
- Even relatively simple tasks, such as reviewing over-allocation of resources and shutting down unnecessary resources, can have a big impact.
Regular attendance
- It is important to communicate with your team by actively participating in daily and weekly regular meetings and evening gatherings.
- When you start implementing SLI/SLO in earnest, it's important to build relationships with non-engineers and the engineering team early on so that discussions with them can go smoothly.
2. Organizing the alert infrastructure
It is advisable to prepare alerts before introducing SLI/SLO. is.
If alerts are overloaded or distributed across multiple tools, not only will the operational burden increase, but even if an SLO violation is detected, it will be difficult to take appropriate action.
- Organize "Which alerts are the most important?" and "Are there any alerts that are just noise?"
- Review the alert trigger criteria and notification groups to ensure operational costs are not too high
Datadog and Grafana example
- Datadog
- The UI is comprehensive, making it easy to handle log-based custom metrics and latency/availability SLOs all at once.
- Grafana
- Supports a wide range of data sources, including Prometheus and Cloud Monitoring
- Flexible configuration is possible, such as integrating Alertmanager and linking with Terraform for IaC.
Why IaC Alert Management?
- Change history management
- You can track changes to alert rules using Git etc.
- Environmental Consistency
- The same definition can be used in both production and staging environments.
- High reproducibility
- Quickly adaptable to new projects and teams
Managing alerts as code also makes it easier for your team to review them.
3. Introducing SLIs/SLOs
Once you have gained trust, organized your alert infrastructure, and made your team's operations smoother, it's time to move on to the main focus.SLI/SLOWe will begin implementing the following.
Here's a quick introduction to key terms and some tips to get you started.
Terminology
CUJ (Critical User Journey)
- The most important steps users take on the serviceThis is a definition of:
- For example, on an e-commerce site, the flow would be "Home page → View product page → Add to cart → Complete payment."
- By first defining the CUJ, it becomes clear which indicators (SLIs) should be measured.
SLI (Service Level Indicator)
- It is a quantitative indicator of the quality of a service.
- For example, API response times, error rates, and successful request rates.
- "Which metrics are needed to meet CUJ?HowIt is important for the team to agree on what to measure.
SLO (Service Level Objective)
- This is a "target value" that indicates how much SLI should be achieved.
- For example, "99.9% of requests will respond within 3 seconds."
- This target value will give you an idea of how stable your service is or if there is room for improvement.
Error Budget
- This is an indicator that can be considered a "tolerance" that sets an SLO target other than 100%.
- Example: "SLO 99.9% → tolerate a total error of 0.1% over 30 days"
- It is important to create team rules such as stopping additional development when the error budget is depleted and redirecting resources to restoring reliability.
Burn Rate
- The rate at which the error budget is being consumedIndicates
- For example, for a 30-day goal, a burn rate of 2 would mean burning through your error budget in 15 days.
- Monitoring burn rates over multiple windows, both short-term (5 minutes) and long-term (1 hour), makes it easier to detect both sudden outages and chronic degradation.
Find team members
You can go it alone, but try to find an engineering team member who is open to implementing SLOs.
If there isn't anyone there, it might be a good idea to approach them based on the trust you gained in step 1.
When doing it with members
I prefer members who can fill in the gaps I lack.
In this example, the person has a narrow perspective and is not good at leadership, so we looked for someone who could fill these gaps.
It might be even better if the person is well-known within the company.
If you're doing it alone
The first time I did SLO, Dot Money, I did it alone.
Since I didn't yet know enough about SLOs to teach them to the other members, I worked on it alone in silence, and together with the implementation and business managers, I even decided what to do when the error budget was depleted.
I think it's also possible to proceed on your own and gradually spread it among your engineers.
Start small
From CUJ to the introduction of SLO,There is no need to involve anyone other than PdM and engineers from the beginning.
It is important to first get a sense of SLI/SLO within the server team.
It would be effective for the server team to become capable of operating the system to a certain extent, and then use the benefits gained from that to expand the system to non-engineers.
Deciding on CUJ
It's best to start with something simple at first.
For example, it's a good idea to start with basic functionality like a login API.
There is no need to decide on a large number of CUJs right away. Three or so is enough at first. You can gradually increase the number.
Deciding on SLI
The best place to get the SLI is from the load balancer logs.
This allows you to get metrics closest to your users.
If it is difficult to obtain LB logs, you can use the metrics output by the backend.
When determining your SLI, it's important to be able to filter out automated requests such as curl or malicious requests.
Otherwise, you may end up with numbers that don't reflect the actual user experience.
Latency and availability SLIs: separate or combined?
There are pros and cons to setting latency and availability SLIs separately or together.
separately | together | |
---|---|---|
merit | Separate, so cause identification and analysis are possible | A single SLO makes it easy to understand the overall service quality |
Disadvantages | More SLIs to manage | Detailed problem analysis may be difficult |
Personally, I recommend setting them separately.
This is because it makes it easier to identify the root cause of a problem and lead to more detailed improvement activities. Also, because latency and availability have different characteristics, managing them separately makes it easier to take appropriate measures for each.
Determine your SLOs
When deciding on a Target SLO, it is best to set it to a value that is achievable with the current SLI. If the error budget continues to be at 100%, you can gradually adjust it.
I have a spreadsheet to organize my SLIs/SLOs.
We recommend using this in the early stages to manage SLIs and SLOs internally within your team.
SLI research and organization spreadsheet
Setting alerts and error budget burn rates
Once you have implemented the SLO, you can set up alerts.
We recommend setting alerts on your error budget burn rate.
Cloud Monitoring and Datadog handle burn rate differently.
Datadog
You'll need to calculate the burn rate yourself, but Datadog's official documentation provides a table.

Grafana + Cloud Monitoring query example

Example queries in Datadog

Fast and Slow Burn Rates
There are two types of burn rates: fast and slow.
Each has the following characteristics:
- fast
- Monitor your burn rate over a short lookback period, such as 5m.
- The main purpose is to detect bugs during release, degradation during infrastructure changes, and degradation of external APIs.
- Specifically, if Graceful Shutdown isn't working properly at release time, a fast burn rate is useful.
- slow
- Monitor your burn rate over a long lookback period, such as 1h.
- The purpose is to detect gradual deterioration (such as an increase in query latency due to the number of DB records, or n+1 of new feature code) rather than detecting sudden deterioration in service quality.
Fast and slow combined burn rate alert
In the Datadog example, we use a combined alert of 5m (fast) and 1h (slow).
It will fire when both the 5m and 1h thresholds are reached, which will reduce alert noise.
But the benefit of immediacy is lost.
Alert Noise) How to deal with frequent alerts during times of low request volume
Late at night or early in the morning, the total number of requests (the denominator) is small, and the impact of bad requests is greater, so alerts are more likely to be triggered.

With this solution, the fast + slow composite burn rate alert mentioned above is effective, but the benefit of immediacy is lost.
So what should we do?
In this example,
Even if the error budget is depleted, there is no need to respond during the middle of the night, so mute the burn rate alert outside of business hours.
This is the policy we follow.
When the error budget is depleted, we allow for a time lag, which is covered by biweekly SLI/SLO retrospective meetings.
Compromise is also important.
4. Fostering an SLO culture
Here are some steps to foster an SLO culture within your server team:
- Conduct a team-wide training session on the basic concepts and importance of SLI/SLO
- Organize regular SLO review meetings with the whole team
- Establish a response flow for when an SLO is violated and share it with the team
- Celebrate contributions to SLO improvements and share success with the whole team
- Make it a habit to consider the impact on SLOs when introducing new features or changes
Team study session
First, we will hold study sessions and review meetings to ensure that the idea is absorbed within the team.
SLI/SLO study materials for engineers
masked
Frequency of retrospective meetings
Holding a retrospective meeting once a week can be too frequent and can lead to exhaustion.
It's important to set the right frequency based on the volume of SLOs and your team's resources.
In this case, SLO review meetings are held once every two weeks.
This provides a balanced frequency that allows you to grasp long-term trends without being overly concerned with short-term fluctuations.
Contents of the retrospective meeting
We recommend covering the following topics during your retrospective:
- Check the current status of each SLO: Check the error budget consumption and burn rate
- Analyzing the cause and considering countermeasures in the event of an SLO violation
- Consider proposing new SLIs/SLOs or adjusting existing SLOs
- Check the progress of measures to improve SLO
- Sharing knowledge and clarifying questions about SLOs within the team
Specifically, we will look at dashboards such as Grafana and focus our discussions on areas where error budgets are being depleted or where burn rates are high.
We also check the progress of improvements to SLOs that were problematic in the previous review. Through these discussions, the entire team can understand the importance of SLOs and promote continuous improvement activities to improve the quality of services.
Retrospective meeting template
Here we introduce a retrospective meeting template using Wrike.

The important points are as follows:
- SLI/SLO name (or error budget name)
- category
- Latency or Availability
- Task availability
- There is no need to create a task when there are external factors that cannot be controlled.
- Error Budget
- Update the values at the time of the retrospective meeting.
category
It's not required, but defining it from the beginning may make things easier when the number increases.
In the case of this service, there are 120 SLOs, and teams are divided into categories to review them.
Penetration to non-PdMs and engineers
The goal of spreading this to non-PdMs and engineers is to establish SLI/SLO as a common language across the service and use technical metrics to inform business decisions.
This allows us to simultaneously improve the quality of our services and achieve our business goals.
Furthermore, if this can be spread to people other than PdMs and engineers, it may also lead to increased motivation for the server team as they will be praised by people other than engineers.
Materials for persuading non-engineers
masked
Definition of Penetration
This means that people other than PdM and engineers can independently use (speak up) about SLI/SLO and discuss it with the server team.
Specifically, this means that people other than PdMs and engineers will be able to understand the SLI/SLO figures and make service improvement proposals and prioritization based on them.
Clearly reach the goal
It visualizes "what state is necessary to attract and disseminate ideas to people other than PdMs and engineers."
In this example, there was a business team member who was on good terms with the server team, so we aimed to get the idea across to this person and have it spread from there.

Key words to use when expanding to non-PdMs and engineers
- The introduction of SLI/SLO is expected to improve service quality and customer satisfaction.
- Research from Amazon and Google has shown that improving latency by just 0.1 seconds can have a significant impact on sales.
- Utilizing SLI/SLO not only provides a basis for making decisions when responding to failures, but in the long term it also improves service quality, reducing the frequency of failures and reducing the time constraints of people other than PdMs and engineers.
- Let's work together to improve service reliability and drive business results
I have attached the materials I actually used when explaining the project to people other than PdMs and engineers.
masked
5. Adjusting SLIs/SLOs
As we make quality improvements, it is not always healthy to have an error budget of 100%, so we adjust the SLI/SLO so that the error budget is exactly 0.
This allows us to balance improving service quality with technological challenges.
How to do a full SLO retrospective
The set SLOs are reviewed every three or six months.
The purpose of this is to:
- As we improve qualityThis creates a situation where there is room in the error budget, so we need to correct this.
- The SLO you set is too lenient, correcting the error budget gap
You can obtain a list of SLOs that need to be corrected by using a tool called vigil, which will be described later.
Make a list of what needs to be corrected
The vigil script can output the SLOs that need to be corrected in an Excel file.
The SLOs extracted by vigil are:
- A generous SLO where the error budget has never fallen below n% during the specified period
- An SLO that has a negative error budget 50% of the time period
Usage example

The output Excel file is transcribed into a spreadsheet or similar and discussions are held.
Things to check first
Please check that your query is correct.
For example, check whether each Good/Total excludes access by bots and whether unintended requests are excluded.
Adjust SLOs or SLIs?
Latency SLOs should not be used for SLO adjustments.
Once you are satisfied that your queries are correct, you can adjust your availability SLO.
Latency SLO adjusts the SLI.
SLI Tuning Guide

What you need to adjust SLI
- Existing threshold
- SLO
- SLI (latency, availability status code, etc.)
- A graph of the corresponding SLI
SLI adjustment examples by pattern
External APIs increase latency

overview
There was a temporary degradation in the API that the service relies on, which worsened the service's SLI and depleted the error budget, but it has since stabilized.
No action is required.
Unknown pattern

overview
The SLI set when the SLO was added was too strict, and the SLO is consistently below the target.
This often happens when you add a new SLO.
In this case, you should recheck the SLI.
1. Check if the SLI is correct
Check your current SLI.

In this example, please check that not only the max 1600 threshold but also the conditions for using it as SLI are correct.
For example, for a latency SLO, you might expect a 2xx response and a response between 0 and 1600ms to be good, but the status condition expression may be incorrect and include 5xx.
For latency SLO, the SLO is fixed at 99.9%, so you cannot change the SLO. Adjust the latency side of the SLI.
2. Check the latency of the endpoint
Use Grafana or similar tools to observe latency from the endpoint defined in the SLI conditional expression.

By default, the Group by function is set to mean, so we will change it to 99th percentile.
In SLI, a response time of up to 1600ms is considered good, but in reality, the average p99 value during the day exceeds 1600ms.
Therefore, it is recommended to set it to 1600ms or more.
Since it is difficult to find the optimal value from the beginning, you can change it to 1800ms, and if the error budget is always at 100%, you can tighten it to 1700ms.
Making change history clearer
If your SLOs move significantly and you don't know why, it's possible that you adjusted your SLIs/SLOs.
The change history needs to be easy to follow.
Using GitHub Actions makes it easier to track change history by automatically tagging commit messages when SLI/SLO changes are made.
This is one of the reasons why alerts and SLIs/SLOs are incorporated into IaC.
Actions yaml
Be careful with latency adjustments
For latency, we recommend adjusting the SLI response time.
You can also adjust by adjusting the Target SLO, but some people might adjust the SLI response time while others might adjust the Target SLO.
It is a good idea to decide in advance that you will "adjust based on response time" to avoid such variations in work.
This helps ensure consistency within the team and makes managing SLIs/SLOs more efficient.
Flow from start to finish

Tips
What do Prometheus percentiles mean?
Looking at the code, the buckets are as follows:
25, 50, 100, 200, 400, 800, 1600, 3200, 6400
Here is an example of 100 requests distributed across buckets.
This distribution can be depicted graphically as follows:
To calculate p99, first accumulate the number of requests for each bucket.
99% of 100 requests equals 99 requests. The first bucket where the cumulative number of requests is 99 or greater is bucket 7 (800ms - 1600ms), so p99 is 1600ms.
This means that out of 100 requests, 99% of requests were processed in 1600ms or less.
Conclusion
I hope this will be of some help in introducing SLI/SLO.