The Definitive Guide to SLI/SLO
This is Hasegawa @rarirureluis from the Service Reliability Group (SRG) of the Media Headquarters.
#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.
This article introduces some limited internal documentation that explains the steps to take when joining a team as an SRE to implement SLIs/SLOs within the team, using specific materials and templates.
I hope this helps in some way.
Introduction0. Mindset1. Gaining credibilityToil reductionCost reductionRegular Attendance2. Organizing the alert infrastructureDatadog and Grafana ExampleWhy IaC Alert Management?3. Introducing SLIs/SLOsExplanation of termsCUJ (Critical User Journey)SLI (Service Level Indicator)SLO (Service Level Objective)Error BudgetBurn RateFind team membersWhen working with membersIf you're doing it aloneStart smallDeciding on CUJDeciding on SLISeparate or joint latency and availability SLIs?Decide on your SLOsSetting alerts and error budget burn ratesGrafana + Cloud Monitoring Query ExampleExample queries in DatadogFast and Slow Burn RatesAlert Noise) How to deal with frequent alerts during periods of low requests4. Cultivating an SLO cultureTeam study sessionsFrequency of retrospective meetingsContents of the retrospective meetingRetrospective templatecategoryPenetration to non-PdMs and non-engineersDefinition of PenetrationClarify your goalKey words to use when expanding to non-PdMs and non-engineers5. Adjusting SLIs/SLOsHow to do a full SLO retrospectiveMake a list of what needs to be correctedUsage exampleThings to check firstAdjust the SLO or adjust the SLI?SLI Tuning GuideWhat you need to adjust SLIExample of SLI adjustment by patternExternal APIs increase latencyBe careful with latency adjustmentsoverviewUnknown Patternoverview1. Check if the SLI is correct2. Check the latency of the endpointMaking change history clearerActions yamlFlow from start to finishTipsWhat do Prometheus percentiles mean?Conclusion
Introduction
How did you all"The experience of the service we operate is gradually deteriorating."Are you preventing this?
Over time, an Internet service may experience an increase in the number of database records, resulting in worsening latency, or as the service grows, the number of requests may increase, resulting in high load and a poor user experience.
By using SLI/SLO, we can visualize these potential service degradations as quality metrics and respond accordingly.

In this article, the authorEmbedded SREI joined another team as aSLI/SLOWe have summarized the procedures and know-how for introducing the system.
By reading this, you will get some tips on how to smoothly introduce SLI/SLO when you join another team as an Embedded SRE.
0. Mindset
It may take more than six months from the introduction of SLIs/SLOs to the point where the team is autonomously managing them, so patience is essential.
Along the way, you will encounter situations where you can enjoy the benefits of SLI/SLO. Once you experience them, you can use those benefits as motivation to keep using them. For example, you can detect potential degradation in service quality early and share a common language with team members other than engineers.
1. Gaining credibility
If you join another team and suddenly start talking about SLI/SLO, the other team may not see the value and may think that it will just take up a lot of time and effort.
Therefore, a phase is needed to make both the development team, PdMs, and non-engineers realize that "this person is reliable" and "there is benefit to participating in this person's efforts."
Toil reduction
- ToilThis refers to the repetitive, tedious tasks and menial work involved in maintaining infrastructure and code.
- When you join the team, the first thing you should do is understand the infrastructure configuration and deployment flow, and look for toilets that can be reduced immediately.
- example:
- Cleaning up unmaintained IaC
- Accelerating deployment flows
- Reviewing excessive resource allocations
ToilReducing this is extremely effective in reducing the burden on development team members and creating spare capacity.
It will be easier to gain your team's trust by making them realize that "the amount of monotonous work has been reduced since I joined."
Cost reduction
- Cost reductions can be demonstrated as results that are easy to understand for non-engineers as well.
- Even relatively simple tasks like reviewing over-allocation of resources and shutting down unnecessary resources can have a big impact.
Regular Attendance
- It is important to communicate with your team by actively participating in daily and weekly regular meetings and evening gatherings.
- When you start to seriously implement SLI/SLO, build relationships with non-engineers and the engineering team early on so that discussions with them can go smoothly.
2. Organizing the alert infrastructure
It is advisable to prepare alerts before introducing SLI/SLO. is.
If there are too many alerts or they are distributed across multiple tools, not only will the operational burden increase, but you will also be unable to take appropriate action even if an SLO violation is detected.
- Organize "Which alerts are the most important?" and "Are there any that are just noise?"
- Review the alert trigger criteria and notification groups to ensure operational costs are not too high
Datadog and Grafana Example
- Datadog
- The UI is comprehensive, making it easy to handle log-based custom metrics and latency/availability SLOs all at once.
- Grafana
- Supports a wide range of data sources including Prometheus and Cloud Monitoring
- Flexible configuration is possible, such as integrating Alertmanager and linking with Terraform for IaC.
Why IaC Alert Management?
- Change history management
- You can track changes to alert rules using Git, etc.
- Environmental Consistency
- The same definition can be used in both production and staging environments.
- High reproducibility
- Can be quickly adapted to new projects and teams
Managing alerts as code also makes them easier to review with your team.
3. Introducing SLIs/SLOs
Now that we have gained trust, organized our alert infrastructure, and made sure our team is running smoothly, it's time to get to the heart of the matter.SLI/SLOWe will begin implementing the following.
Here's a quick introduction to key terminology and some tips to get you started.
Explanation of terms
CUJ (Critical User Journey)
- The most important steps a user takes on the serviceThis is the definition.
- For example, on an EC site, the sequence of steps would be "Top page → View product page → Add to cart → Complete payment."
- By first determining the CUJ, it becomes clear which indicators should be measured (SLIs).
SLI (Service Level Indicator)
- It is a quantitative indicator of the quality of a service.
- For example, API response times, error rates, successful request rates, etc.
- "Which metrics are needed to meet CUJ?HowIt is important for the team to agree on what to measure.
SLO (Service Level Objective)
- This is a "target value" that indicates how close the SLI should be achieved.
- For example, "99.9% of requests will respond within 3 seconds."
- This target value gives you an idea of how stable your service is or if there is room for improvement.
Error Budget
- This can be thought of as an indicator of the "tolerance" that sets the SLO target rather than setting it at 100%.
- Example: "SLO 99.9% → tolerate a total error of 0.1% over 30 days"
- It is important to create team rules, such as stopping additional development when the error budget is depleted and redirecting resources to restoring reliability.
Burn Rate
- How quickly the error budget is being consumedIndicates
- For example, with a 30-day goal, a burn rate of 2 would be "the pace at which you will burn through your error budget in 15 days."
- Monitoring burn rate over multiple windows, both short-term (5 minutes) and long-term (1 hour), makes it easier to detect both sudden outages and chronic degradation.
Find team members
You can go it alone, but try to find a team of engineers who are open to implementing SLOs.
If they don't have one, it might be a good idea to approach them based on the trust you gained in step 1.
When working with members
I prefer members who can fill in the gaps I lack.
In this example, the person had a narrow perspective and was not good at leadership, so we looked for someone who could fill these gaps.
It might be even better if the person is well-known within the company.
If you're doing it alone
The first time I did SLO, dotmoney, I did it alone.
Since I didn't yet know enough about SLOs to be able to teach them to the other members, I worked on it alone in silence, and even decided, together with the implementation and business managers, what to do when the error budget was depleted.
I think it's also possible to work on it alone and gradually spread it among your engineers.
Start small
From CUJ to the introduction of SLO,There is no need to involve anyone other than PdM and engineers from the beginning.
It is important to first get a sense of SLI/SLO within the server team.
It would be effective to first get the server team up to a certain level of operation, and then use the benefits gained to expand to non-engineers.
Deciding on CUJ
It's best to start with something simple at first.
For example, we recommend starting with basic functionality like a login API.
You don't need to decide on a lot of CUJs right away. Three is enough at first. You can increase the number gradually.
Deciding on SLI
The best place to get the SLI is through the load balancer logs.
This allows you to get metrics closest to your users.
If it is difficult to obtain LB logs, you can use the metrics output by the backend.
When determining your SLI, it's important to be able to filter out automated requests such as curl or malicious requests.
Otherwise, you may end up with numbers that don't reflect the actual user experience.
Separate or joint latency and availability SLIs?
There are pros and cons to setting latency and availability SLIs separately or together.
separately | together | |
---|---|---|
merit | Separate, so cause identification and analysis are possible | One SLO makes it easy to understand the overall service quality |
Disadvantages | More SLIs to manage | Detailed problem analysis may be difficult |
Personally, I recommend setting them separately.
This is because it is easier to identify the root cause of a problem and lead to more detailed improvement activities. Also, because latency and availability have different characteristics, managing them separately makes it easier to take appropriate measures for each.
Decide on your SLOs
When deciding on your Target SLO, it is a good idea to set it to a value that is achievable with your current SLI. If your error budget continues to be at 100%, you can gradually adjust it.
I have a spreadsheet to help me organize my SLIs/SLOs.
We encourage you to use this in the early stages to manage your SLIs and SLOs internally.
SLI Research and Organization Spreadsheet
Setting alerts and error budget burn rates
Once you have implemented the SLO, you can set up alerts.
We recommend setting alerts on your error budget burn rate.
Cloud Monitoring and Datadog handle burn rate differently.
Datadog
You'll need to calculate the burn rate yourself, but Datadog's official documentation provides a table.

Grafana + Cloud Monitoring Query Example

Example queries in Datadog

Fast and Slow Burn Rates
There are two types of burn rates: fast and slow.
Each has the following characteristics.
- fast
- Monitor your burn rate over a short lookback period, such as 5m.
- The main purpose is to detect bugs during release, degradation during infrastructure changes, and degradation of external APIs.
- Specifically, if Graceful Shutdown isn't working properly at the time of release, a fast burn rate is effective.
- slow
- Monitor your burn rate over a long lookback period, such as 1h.
- The main purpose is to detect gradual deterioration (such as an increase in query latency due to the number of DB records, or n+1 of a new feature code that has been released) rather than detecting sudden deterioration in service quality of fast.
Fast and slow combined burn rate alert
In the Datadog example, we use a combined alert of 5m (fast) and 1h (slow).
It will fire when both the 5m and 1h thresholds are reached, which helps to suppress alert noise.
But the benefit of immediacy is lost.
Alert Noise) How to deal with frequent alerts during periods of low requests
Late at night and early in the morning, the total number of requests (the denominator) is small, and the impact of bad requests is greater, so alerts are more likely to be triggered.

With this method, the fast + slow composite burn rate alert mentioned above is effective, but it loses the benefit of immediacy.
So what should we do?
In this example,
Even if the error budget is depleted, there is no need to respond during the middle of the night, so mute the burn rate alert outside of business hours.
This is the policy we have adopted.
We allow for a time lag when the error budget is exhausted, and this is covered in biweekly SLI/SLO retrospectives.
Compromise is also important.
4. Cultivating an SLO culture
Here are some steps to foster an SLO culture within your server team:
- Conduct a team-wide training session on the basic concepts and importance of SLI/SLO
- Schedule regular SLO reviews and have the whole team participate
- Establish a procedure for responding to SLO violations and share it with your team.
- Celebrate contributions to SLO improvements and share their success with the whole team
- Make it a habit to consider the impact on SLOs when introducing new features or changes.
Team study sessions
First, we will hold study sessions and review sessions to ensure that the idea is absorbed within the team.
SLI/SLO study materials for engineers
masked
Frequency of retrospective meetings
Holding retrospective meetings once a week can be too frequent and can lead to exhaustion.
It's important to set the right frequency based on the volume of SLOs and your team's resources.
In this case, SLO review meetings are held once every two weeks.
This provides a well-balanced frequency that allows you to grasp long-term trends without being overly concerned with short-term fluctuations.
Contents of the retrospective meeting
We recommend covering the following during your retrospective:
- Check the current status of each SLO: Check the error budget consumption and burn rate
- Analyze the cause and consider countermeasures in case of SLO violation
- Consider proposing new SLIs/SLOs or adjusting existing SLOs
- Check the progress of measures to improve SLO
- Share knowledge and clarify questions about SLOs within your team
Specifically, we will look at dashboards such as Grafana and focus on discussing issues where error budgets are being depleted or where burn rates are high.
We also check the progress of improvements to SLOs that were problematic in the previous review. Through these discussions, the importance of SLOs can be understood by the entire team, and continuous improvement activities can be promoted to improve the quality of services.
Retrospective template
Here we introduce a retrospective template using Wrike.

The important points are as follows:
- SLI/SLO name (or error budget name)
- category
- Latency or Availability
- Can it be made into a task?
- There is no need to create a task when there are external factors that cannot be controlled.
- Error Budget
- We will update the values at the retrospective meeting.
category
It's not required, but defining it from the beginning may make things easier when the number increases.
In the case of this service, there are 120 SLOs, and reviews are conducted by dividing into teams based on category.
Penetration to non-PdMs and non-engineers
The goal of spreading this to non-PdMs and non-engineers is to establish SLI/SLO as a common language across the service and leverage technical metrics to inform business decisions.
This allows you to simultaneously improve the quality of your services and achieve your business goals.
Furthermore, if the idea can spread to people other than PdMs and engineers, it may also lead to increased motivation for the server team, as they will be praised by people other than engineers.
Materials for persuading non-engineers
masked
Definition of Penetration
This means that people other than PdM and engineers can independently use (speak up) about SLI/SLO and discuss it with the server team.
Specifically, this means that people other than PdMs and engineers will be able to understand the SLI/SLO numbers and make suggestions for improving and prioritizing services based on them.
Clarify your goal
It visualizes "what state would be necessary to attract people other than PdM and engineers and have them adopt the system?"
In this example, there was a business team member who was on good terms with the server team, so the aim was to have the concept spread to this person and then have it spread from there.

Key words to use when expanding to non-PdMs and non-engineers
- The introduction of SLI/SLO is expected to improve service quality and customer satisfaction.
- Studies by Amazon and Google have shown that improving latency by just 0.1 seconds can have a significant impact on sales.
- Utilizing SLI/SLO not only provides a basis for decision-making when responding to failures, but in the long term it also improves service quality, reducing the frequency of failures and reducing the time constraints of people other than PdMs and engineers.
- Let's work together to improve service reliability and drive business results
I have attached the materials I used when actually explaining the project to people other than PdMs and engineers.
masked
5. Adjusting SLIs/SLOs
As we make quality improvements, it is not always healthy to have an error budget of 100%, so we adjust our SLIs/SLOs so that the error budget is exactly 0.
This allows us to balance improving service quality with technological challenges.
How to do a full SLO retrospective
The set SLOs will be reviewed every 3 or 6 months.
The purpose of this is to:
- As we improve qualityThis creates a situation where there is excess margin in the error budget, so we need to correct that.
- The SLO you set is too lenient,Remedying the error budget gap
You can get a list of SLOs that need to be corrected by using a tool called vigil, which will be described later.
Make a list of what needs to be corrected
The vigil script can output the SLOs that need to be corrected in an Excel file.
The SLOs extracted by vigil are:
- A generous SLO where the error budget has never fallen below n% during the specified period
- An SLO that has a negative error budget 50% of the time period
Usage example

The output Excel file is then copied into a spreadsheet or similar and discussions are held.
Things to check first
Please check that your query is correct.
For example, check whether each Good/Total excludes access by bots and whether unintended requests are excluded.
Adjust the SLO or adjust the SLI?
Latency SLOs should not be used for SLO adjustments.
Once you are satisfied that your queries are correct, you can adjust your availability SLOs.
Latency SLO adjusts the SLI.
SLI Tuning Guide

What you need to adjust SLI
- Existing threshold
- SLO
- SLI (latency, status code if availability, etc.)
- A graph of the corresponding SLI
Example of SLI adjustment by pattern
External APIs increase latency

overview
A temporary degradation occurred in the API that the service relies on externally. This caused the service's SLI to worsen and the error budget to be depleted, but it has since stabilized.
No action is required.
Unknown Pattern

overview
The SLI set when the SLO was added is strict, and the SLO is consistently below the limit.
This often happens when you add a new SLO.
In this case you should recheck the SLI.
1. Check if the SLI is correct
Check your current SLI.

In this example, please check that not only the max 1600 threshold but also the conditions for using it as SLI are correct.
For example, in the case of a latency SLO, you expect a 2xx response and a response between 0 and 1600ms to be considered good, but it is possible that the status condition expression is incorrect and includes 5xx.
For latency SLO, the SLO is fixed at 99.9%, so you cannot change the SLO. Adjust the latency side of the SLI.
2. Check the latency of the endpoint
Use Grafana or similar tools to observe latency from the endpoint defined in the SLI condition expression.

By default, the Group by function is set to mean, so we will change it to 99th percentile.
In SLI, a response time of up to 1600ms is considered Good, but in reality, the average p99 value during the day exceeds 1600ms.
Therefore, it is recommended to set it to 1600ms or more.
Since it is difficult to find the optimal value from the beginning, you can change it to 1800ms, and if the error budget is always at 100%, make it stricter by setting it to 1700ms.
Making change history clearer
If your SLOs move significantly and you don't know why, it's possible that you adjusted your SLI/SLOs.
The change history needs to be clear so that it is easy to follow.
You can use GitHub Actions to make it easier to track changes by automatically tagging commit messages when SLI/SLO changes are made.
This is one of the reasons why we bring alerts and SLIs/SLOs into IaC.
Actions yaml
Be careful with latency adjustments
For latency, we recommend adjusting the SLI response time.
You can also adjust it by adjusting the Target SLO, but some people might adjust the SLI response time while others adjust the Target SLO.
It is a good idea to decide in advance that you will "adjust based on response time" to avoid this kind of variation in work.
This helps ensure consistency across teams and makes managing SLIs/SLOs more efficient.
Flow from start to finish

Tips
What do Prometheus percentiles mean?
Looking at the code, the buckets are as follows:
25, 50, 100, 200, 400, 800, 1600, 3200, 6400
Here is an example of 100 requests distributed across buckets.
This distribution can be depicted graphically as follows:
To calculate p99, first accumulate the number of requests in each bucket.
99% of 100 requests equates to 99 requests. The first bucket where the cumulative requests are greater than or equal to 99 is bucket 7 (800ms - 1600ms), so p99 is 1600ms.
This means that out of 100 requests, 99% of requests were processed in 1600ms or less.
Conclusion
I hope this article will be of some help in introducing SLI/SLO.