The Complete Guide to SLI/SLO
This is Hasegawa (@rarirureluis) from the Service Reliability Group (SRG) of the Media Division.
#SRGThe Service Reliability Group primarily provides comprehensive support for the infrastructure surrounding our media services, focusing on improving existing services, launching new ones, and contributing to open-source software (OSS).
This article presents a somewhat limited version of an internal document that explains the steps involved in implementing SLIs/SLOs within a team when joining as an SRE, using specific materials and templates.
I hope this is of some help.
Introduction0. Mindset1. Gaining trustToil reductionCost reductionRegular attendance2. Streamlining the alert infrastructureExamples using Datadog and GrafanaReasons for using IaC for alert management3. Implementation of SLI/SLOExplanation of termsCUJ (Critical User Journey)SLI (Service Level Indicator)SLO (Service Level Objective)Error BudgetBurn rateFind team membersWhen doing it with membersIf you do it aloneStart smallDecide on CUJDecide on SLIShould latency and availability SLIs be separate or combined?Determine the SLO (Service Level Objective).Alert settings and error budget burn rateExample queries in Grafana + Cloud MonitoringExample queries in DatadogRegarding fast and slow burn rates(Alert noise) How to deal with alerts that occur frequently during periods of low request volume.4. Fostering an SLO cultureTeam study sessionFrequency of review meetingsContents of the review meetingReview meeting templatecategoryAdoption by people other than PdMs and engineersDefinition of penetrationClearly define the goalRecruitment phrases when expanding to people other than PdMs and engineers5. Adjusting SLI/SLOHow to review all SLOsList the things that need to be corrected.Usage exampleFirst things to checkShould I adjust SLO or SLI?SLI Adjustment GuideWhat you need when adjusting SLISLI adjustment examples by patternLatency deterioration due to external APIsoverviewUnknown patternoverview1. Check if there are any errors in the SLI configuration.2. View the latency of the relevant endpoint.Make the change history easier to understand.Actions yamlAdjusting latency requires careful attention.Flow from start to finishTipsPrometheus: Meaning of percentileIn conclusion
Introduction
How do you all"The user experience of the services we operate is gradually deteriorating."Are you preventing that from happening?
Over time, internet services may experience increased latency due to a growing number of database records, or high load due to increased requests as the service grows, potentially degrading the user experience.
By implementing SLI/SLO, these potential service degradations can be visualized as quantitative quality metrics and addressed accordingly.

In this article, the authorEmbedded SREHe joined a different team as a substitute, and actuallySLI/SLOThis document summarizes the procedures and know-how involved in implementing [the system/system].
Reading this will give you tips on how to smoothly implement SLI/SLO when you join another team as an Embedded SRE.
0. Mindset
It can take more than six months from the initial implementation of SLI/SLOs to the point where the team is autonomously managing them. Patience is essential.
Along the way, you'll encounter situations where you can reap the benefits of SLI/SLO. Once you experience this, you can use that benefit as motivation to maintain it. For example, you can detect potential service quality deterioration early on, or you can establish a common language with team members other than engineers.
1. Gaining trust
If you join a different team and suddenly start talking about SLI/SLO, the other party may not see its value and might think it's just "a lot of work."
Therefore, a phase is needed where both the development team, product managers, and engineers recognize that "this person is reliable" and "participating in this person's work is beneficial."
Toil reduction
- ToilThis refers to repetitive, tedious tasks, as well as simple tasks associated with infrastructure and code maintenance.
- Once you join the team, first familiarize yourself with the infrastructure configuration and deployment flow, and look for any toil that can be reduced immediately.
- example:
- Cleaning an unmaintained IaC
- Accelerating the deployment flow
- Review of excessive resource allocation
ToilReducing this is very effective in reducing the burden on development team members and creating more capacity for them.
By making team members feel that their presence has reduced monotonous tasks, you can more easily gain their trust.
Cost reduction
- Cost reduction can be demonstrated as a tangible result that is easily understood by people other than engineers.
- Even relatively simple tasks, such as reviewing resource allocations or shutting down unnecessary resources, can have a significant impact.
Regular attendance
- It is important to communicate with the team by actively participating in daily and weekly regular meetings and evening gatherings.
- When you're ready to seriously begin implementing SLI/SLO, it's important to build relationships with non-engineers and the engineering team from an early stage to ensure smooth discussions.
2. Streamlining the alert infrastructure
It is advisable to set up alerts before implementing SLI/SLO. is.
When alerts are numerous or scattered across multiple tools, it not only increases the operational burden but also prevents appropriate action from being taken even when SLO violations are detected.
- Organize your alerts to determine "Which alerts are the most important?" and "Are there any alerts that are just noise?"
- To keep operational costs from becoming too high, we will review the alert triggering criteria and the groups to which notifications are sent.
Examples using Datadog and Grafana
- Datadog
- The UI is comprehensive, making it easy to manage log-based custom metrics and SLOs (Service Level Objectives) for latency and availability all in one place.
- Grafana
- Supports a wide range of data sources, including Prometheus and Cloud Monitoring.
- Flexible configurations are possible, such as integrating with Alertmanager or using Terraform to enable Infrastructure as Control (IaC).
Reasons for using IaC for alert management
- Change history management
- You can track changes to alert rules using tools like Git.
- Environmental consistency
- The same definition can be reused in both the production and staging environments.
- High reproducibility
- Quickly adaptable to new projects and new teams.
Managing alerts with code makes team reviews easier.
3. Implementation of SLI/SLO
Once trust has been established, the alert infrastructure has been streamlined, and team operations are running smoothly, it's finally time to tackle the main challenge.SLI/SLOWe will now begin the implementation.
Here, we'll briefly introduce some key terms and some tips for getting started.
Explanation of terms
CUJ (Critical User Journey)
- The most important flow of operations that users perform on the serviceThis is a definition of [the concept].
- For example, on an e-commerce site, the process would be: "Homepage → View product page → Add to cart → Complete payment."
- By first defining the CUJ (Critical Values and Judgements), we clarify which metrics (SLIs) should be measured.
SLI (Service Level Indicator)
- This is an indicator that quantifies the quality of a service.
- Examples include API response time, error rate, and successful request rate.
- Which metrics are needed to meet CUJ requirements?HowIt's important for the team to agree on whether or not to take measurements.
SLO (Service Level Objective)
- This is a "target value" that indicates the degree to which SLIs should be achieved.
- For example, "99.9% of requests receive a response within 3 seconds."
- This target value helps us understand how stable the service is, or whether there is room for improvement.
Error Budget
- This is an indicator that can be described as the "acceptable amount" to set instead of 100% for SLO targets.
- Example: "SLO 99.9% → Allows a total error rate of 0.1% over 30 days."
- It's important to establish team rules, such as stopping additional development and redirecting resources to restoring reliability once the error budget is depleted.
Burn rate
- The rate at which the error budget is consumedThis will show you.
- For example, a burn rate of 2 for a 30-day goal means "the pace at which you'll run out of error budget in 15 days."
- Monitoring the burn rate across multiple windows—short-term (5 minutes) and long-term (1 hour)—makes it easier to detect both sudden failures and chronic degradation.
Find team members
You can proceed on your own, but try to find engineers who are positive about implementing SLOs.
If they're not there, you might want to try asking them based on the trust you gained in step 1.
When doing it with members
I prefer members who can fill in the gaps in my own shortcomings.
In this particular case, the person had a narrow perspective and struggled with leadership, so they looked for someone to fill those gaps.
It might be even better if the person is well-known within the company.
If you do it alone
I did my first SLO (Service Level Objective) project, DotMoney, all by myself.
Since I didn't yet have enough knowledge about SLOs to teach the team members, I proceeded quietly on my own, and together with the project manager, we decided on the implementation and how to act in the event of an error budget depletion.
I think it's also a valid approach to proceed independently and gradually introduce it to the engineering team.
Start small
From CUJ to SLO implementation,There's no need to involve anyone other than the Product Manager and engineers from the start.
It's crucial for the server team to first grasp the concept of SLI/SLO.
Once the server team has gained a certain level of operational capability, it's effective to then leverage the benefits gained from that experience to expand the system to non-engineers.
Decide on CUJ
It's best to start with something easy.
For example, it's recommended to start with basic features like a login API.
You don't need to decide on a lot of CUJs all at once. Starting with around three is enough. You can gradually increase the number.
Decide on SLI
The best source for obtaining SLIs is the load balancer logs.
This allows you to obtain metrics at the location closest to the user.
If obtaining LB logs is difficult, you can use the metrics output by the backend.
When defining SLIs, it's important to ensure that automated requests such as those from curl, as well as malicious requests, are excluded.
Otherwise, the resulting metrics may not accurately reflect the actual user experience.
Should latency and availability SLIs be separate or combined?
There are advantages and disadvantages to configuring latency and availability SLIs separately or together.
| separately | together | |
|---|---|---|
| merit | Since they are separate, it is easier to identify and analyze the cause. | A single SLO makes it easier to grasp the overall service quality. |
| Disadvantages | The number of SLIs to manage is increasing. | Detailed problem analysis may become difficult. |
Personally, I recommend setting them up separately.
This makes it easier to identify the root cause of a problem and to implement more detailed improvement activities. Furthermore, since latency and availability have different properties, managing them separately makes it easier to develop appropriate countermeasures for each.
Determine the SLO (Service Level Objective).
When determining the Target SLO, it's best to set a value that seems achievable with the current SLI. If the error budget remains at 100% for an extended period, you can gradually adjust it later.
I have prepared a spreadsheet to organize SLI/SLO.
In the initial stages, we recommend using this to manage SLIs and SLOs within your team.
SLI Research and Organization Spreadsheet
Alert settings and error budget burn rate
Once you've implemented the SLO, you can start setting up alerts.
We recommend setting alerts based on the error budget burn rate.
Cloud Monitoring and Datadog handle burn rate differently.
Datadog
You'll need to calculate the burn rate yourself, but there's a table in the official Datadog documentation.

Example queries in Grafana + Cloud Monitoring

Example queries in Datadog

Regarding fast and slow burn rates
There are two types of burn rates: fast and slow.
Each has the following characteristics:
- fast
- Monitor the burn rate with short lookback periods, such as 5m.
- Its primary purpose is to detect bugs during release, regressions caused by infrastructure changes, and regressions in external APIs.
- To be more specific, if Graceful Shutdown isn't working properly during release, a fast burn rate is a good option.
- slow
- Monitor the burn rate with a long lookback period, such as 1 hour.
- The primary purpose is to detect gradual deteriorations in fast's service quality (such as increased query latency due to the number of database records, or the n+1 update of newly released feature code) rather than sudden, sudden service quality degradations.
Combined burn rate alerts using fast and slow burn rates.
In the Datadog example, an alert combining 5m (fast) and 1h (slow) is used.
The system fires when both the 5m and 1h thresholds are reached, which helps suppress alert noise.
However, this eliminates the advantage of immediacy.
(Alert noise) How to deal with alerts that occur frequently during periods of low request volume.
During late night and early morning hours, the total number of requests (the denominator) is low, making the impact of bad requests greater and thus more likely to trigger alerts.

This solution enables the aforementioned fast + slow combined burn rate alert, but it eliminates the benefit of immediacy.
So, what should we do?
In this example,
Even if the error budget is depleted, there is no need to respond during the night, so mute the burn rate alerts outside of business hours.
This is the policy we are taking.
When the error budget is depleted, we tolerate a time lag and cover it with bi-weekly SLI/SLO review meetings.
Compromise is also important.
4. Fostering an SLO culture
The steps to foster an SLO culture within the server team are as follows:
- We will conduct a team-wide study session on the basic concepts and importance of SLI/SLO.
- Establish regular SLO review meetings and have the entire team participate.
- Establish a response flow for SLO violations and share it within the team.
- We commend members who contributed to improving SLOs and share the results with the entire team.
- Make it a habit to consider the impact on SLOs when introducing new features or changes.
Team study session
First, we'll hold study sessions and review meetings to ensure the information is disseminated throughout the team.
SLI/SLO study group materials for engineers
masked
Frequency of review meetings
Holding review meetings once a week might be too frequent and could lead to burnout.
It is important to set an appropriate frequency based on the amount of SLOs and the team's resource situation.
In this case, we conduct SLO review meetings every two weeks.
This results in a balanced frequency that avoids being overly concerned with short-term fluctuations while also allowing for an understanding of long-term trends.
Contents of the review meeting
We recommend covering the following topics in your debriefing session:
- Checking the current status of each SLO: Checking the consumption of error budget and burn rate.
- Analysis of the causes and consideration of countermeasures in the event of an SLO violation.
- Proposal of new SLI/SLO and consideration of adjusting existing SLOs.
- Progress check of measures to improve SLOs
- Sharing knowledge and resolving questions regarding SLOs within the team.
Specifically, we will focus our discussions on projects that are running low on error budget or have high burn rates, while reviewing dashboards such as Grafana.
We will also review the progress made on the SLOs that were identified as issues in the previous review. Through these discussions, the entire team can understand the importance of SLOs and promote continuous improvement activities aimed at enhancing service quality.
Review meeting template
Here, we introduce a retrospective meeting template using Wrike.

The important points are as follows:
- SLI/SLO name (or error budget name)
- category
- Latency or Availability
- Whether or not it can be turned into a task
- Task creation is unnecessary when there are unavoidable external factors.
- Error budget
- We will update the values at the time of the review meeting.
category
While not mandatory, defining them from the start might make things easier when the number of items increases.
In the case of this service, there are 120 SLOs, and we will divide them into teams by category to conduct reviews.
Adoption by people other than PdMs and engineers
The goal of promoting SLI/SLO to people outside of PdM and engineers is to establish SLI/SLO as a common language for the entire service and to utilize technical metrics in business decision-making.
This allows us to simultaneously improve service quality and achieve business objectives.
Furthermore, if this can be adopted by non-product managers and non-engineers, it may lead to increased motivation within the server team as they will receive praise from those outside the engineering field.
Materials for persuading non-engineers
masked
Definition of penetration
This assumes that individuals other than Product Managers and engineers are voluntarily using (commenting on) SLI/SLO and are able to discuss them with the server team.
Specifically, this means that people other than Product Managers and engineers will be able to understand SLI/SLO figures and use them to propose service improvements and prioritize tasks.
Clearly define the goal
This visualizes "what conditions need to be met to attract and integrate non-product managers and engineers."
In this particular case, we had a business team member who was on good terms with the server team, so our goal was to get the message across to this person and then spread it from there.

Recruitment phrases when expanding to people other than PdMs and engineers
- The implementation of SLI/SLO is expected to improve service quality and customer satisfaction.
- According to research by Amazon and Google, even a latency improvement of just 0.1 seconds can have a significant impact on sales.
- By utilizing SLI/SLO, not only is there information to help make decisions when dealing with failures, but in the long term, service quality improves, reducing the frequency of failures and alleviating the time constraints on people other than product managers and engineers.
- Let's work together to enhance the reliability of our services and drive business results.
I've attached the materials I actually used when explaining this to people other than Product Managers and engineers.
masked
5. Adjusting SLI/SLO
During the quality improvement process, a constant error budget of 100% is not a healthy state, so we adjust the SLI/SLO to bring the error budget to exactly 0.
This allows us to strike a balance between improving service quality and addressing technological challenges.
How to conduct a comprehensive SLO review
The set SLOs will be reviewed every 3 or 6 months.
The purpose of this is as follows:
- As we improve qualityThis creates a situation where there is excess margin in the error budget, so we need to correct it.
- The SLO I set was too lenient.Correcting the situation where there is excess error budget.
You can obtain a list of SLOs that need to be corrected by using a tool called vigil, which will be described later.
List the things that need to be corrected.
The `vigil` script can output SLOs that need correction to an Excel file.
The SLOs extracted by vigil are as follows:
- SLOs that have no margin for error budget, meaning the error budget has never fallen below n% during the specified period.
- 50% of the specified period has a negative error budget.
Usage example

The generated Excel file will be copied to a spreadsheet or similar program, and then discussed.
First things to check
Please check if the query is correct.
For example, check whether access by bots is excluded from each Good/Total, and whether unintended requests are being excluded.
Should I adjust SLO or SLI?
You should not adjust the SLO (Slow Latency Limit) for latency-based SLOs.
Once you've confirmed the query is correct, you can adjust the Availability SLO.
Latency SLO adjusts SLI.
SLI Tuning Guide

What you need when adjusting SLI
- Existing thresholds
- SLO
- SLI (Latency, status codes for availability, etc.)
- Graph of the relevant SLI
SLI adjustment examples by pattern
Latency deterioration due to external APIs

overview
A temporary regression occurred on the API side, which the service relies on externally. This affected the service's SLI and depleted its error budget, but it has since stabilized.
No action is required for this.
Unknown pattern

overview
The SLI (Service Level Objective) set when the SLO was added is too strict, and the system is consistently below the SLO.
This often happens when you add a new SLO (Service Level Objective).
In this case, you need to re-check the SLI.
1. Check if there are any errors in the SLI settings.
Let's check the current SLI configuration.

In this example, in addition to checking the max 1600 threshold, please also verify that the conditions for using it as SLI are correct.
For example, with a latency SLO, we expect a response time of 2xx and between 0 and 1600ms to be considered "Good," but the status condition might be incorrectly written and include 5xx as well.
In the case of Latency SLO, the SLO is fixed at 99.9%, so it cannot be changed. Please adjust the latency side of SLI.
2. View the latency of the relevant endpoint.
Latency is observed using tools like Grafana from the endpoints defined in the SLI condition expression.

By default, the Group by function is set to mean, so change it to 99th percentile.
While SLI is considered to have a good response time up to 1600ms, in reality, the average p99 response time during the day exceeds 1600ms.
Therefore, it is best to set it to 1600ms or higher.
Since it's difficult to find the optimal value from the start, you can change it to 1800ms, and if the error budget is constantly at 100%, you can tighten it to 1700ms.
Make the change history easier to understand.
If the SLO changes significantly and the cause is unknown, it's possible that the SLI/SLO settings were adjusted.
To make it easier to follow along, the change history needs to be easy to understand.
By using GitHub Actions to automatically tag commit messages when SLI/SLO changes occur, you can make it easier to track change history.
This is one of the reasons why alerts and SLI/SLOs are implemented in Infrastructure as Code.
Actions yaml
Adjusting latency requires careful attention.
For latency, we recommend adjusting the SLI response time.
You can also adjust this by adjusting the Target SLO, but some people might adjust it by adjusting the SLI response time, while others might adjust the Target SLO.
To prevent variations in these tasks, it's a good idea to decide in advance to "adjust based on response time."
This ensures consistency within the team and makes SLI/SLO management more efficient.
Flow from start to finish

Tips
Prometheus: Meaning of percentile
The buckets, as seen in the code, are as follows:
25, 50, 100, 200, 400, 800, 1600, 3200, 6400
This is an example where 100 requests are distributed across a bucket.
This distribution can be represented graphically as follows:
To find p99, we first accumulate the number of requests for each bucket.
99% of 100 requests are equivalent to 99 requests. The first bucket where the cumulative number of requests exceeds 99 is bucket 7 (800ms - 1600ms). Therefore, p99 is 1600ms.
This means that out of 100 requests, 99% were processed in 1600ms or less.
In conclusion
I hope this will be of some help in implementing SLI/SLO.
