HOME/Articles/The Definitive Guide to SLI/SLO

The Definitive Guide to SLI/SLO

2025/2/5 23:472025/2/13 20:40

This is Hasegawa @rarirureluis from the Service Reliability Group (SRG) of the Media Headquarters.

#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.

This article introduces some limited internal documentation that explains the steps to take when joining a team as an SRE to implement SLIs/SLOs within the team, using specific materials and templates.

I hope this helps in some way.

Introduction 0. Mindset 1. Gaining credibility Toil reduction Cost reduction Regular Attendance 2. Organizing the alert infrastructure Datadog and Grafana Example Why IaC Alert Management?3. Introducing SLIs/SLOs Explanation of terms CUJ (Critical User Journey)SLI (Service Level Indicator)SLO (Service Level Objective)Error Budget Burn Rate Find team members When working with members If you're doing it alone Start small Deciding on CUJ Deciding on SLI Separate or joint latency and availability SLIs?Decide on your SLOs Setting alerts and error budget burn rates Grafana + Cloud Monitoring Query Example Example queries in Datadog Fast and Slow Burn Rates Alert Noise) How to deal with frequent alerts during periods of low requests 4. Cultivating an SLO culture Team study sessions Frequency of retrospective meetings Contents of the retrospective meeting Retrospective template category Penetration to non-PdMs and non-engineers Definition of Penetration Clarify your goal Key words to use when expanding to non-PdMs and non-engineers 5. Adjusting SLIs/SLOs How to do a full SLO retrospective Make a list of what needs to be corrected Usage example Things to check first Adjust the SLO or adjust the SLI?SLI Tuning Guide What you need to adjust SLI Example of SLI adjustment by pattern External APIs increase latency Be careful with latency adjustments overview Unknown Pattern overview 1. Check if the SLI is correct 2. Check the latency of the endpoint Making change history clearer Actions yaml Flow from start to finish Tips What do Prometheus percentiles mean?Conclusion

Introduction

How did you all"The experience of the service we operate is gradually deteriorating."Are you preventing this?

Over time, an Internet service may experience an increase in the number of database records, resulting in worsening latency, or as the service grows, the number of requests may increase, resulting in high load and a poor user experience.

By using SLI/SLO, we can visualize these potential service degradations as quality metrics and respond accordingly.

In this article, the authorEmbedded SREI joined another team as aSLI/SLOWe have summarized the procedures and know-how for introducing the system.

By reading this, you will get some tips on how to smoothly introduce SLI/SLO when you join another team as an Embedded SRE.

0. Mindset

It may take more than six months from the introduction of SLIs/SLOs to the point where the team is autonomously managing them, so patience is essential.

Along the way, you will encounter situations where you can enjoy the benefits of SLI/SLO. Once you experience them, you can use those benefits as motivation to keep using them. For example, you can detect potential degradation in service quality early and share a common language with team members other than engineers.

1. Gaining credibility

If you join another team and suddenly start talking about SLI/SLO, the other team may not see the value and may think that it will just take up a lot of time and effort.

Therefore, a phase is needed to make both the development team, PdMs, and non-engineers realize that "this person is reliable" and "there is benefit to participating in this person's efforts."

Toil reduction

ToilThis refers to the repetitive, tedious tasks and menial work involved in maintaining infrastructure and code.

When you join the team, the first thing you should do is understand the infrastructure configuration and deployment flow, and look for toilets that can be reduced immediately.

example:

Cleaning up unmaintained IaC
Accelerating deployment flows
Reviewing excessive resource allocations

ToilReducing this is extremely effective in reducing the burden on development team members and creating spare capacity.

It will be easier to gain your team's trust by making them realize that "the amount of monotonous work has been reduced since I joined."

Cost reduction

Cost reductions can be demonstrated as results that are easy to understand for non-engineers as well.

Even relatively simple tasks like reviewing over-allocation of resources and shutting down unnecessary resources can have a big impact.

Regular Attendance

It is important to communicate with your team by actively participating in daily and weekly regular meetings and evening gatherings.

When you start to seriously implement SLI/SLO, build relationships with non-engineers and the engineering team early on so that discussions with them can go smoothly.

2. Organizing the alert infrastructure

It is advisable to prepare alerts before introducing SLI/SLO. is.

If there are too many alerts or they are distributed across multiple tools, not only will the operational burden increase, but you will also be unable to take appropriate action even if an SLO violation is detected.

Organize "Which alerts are the most important?" and "Are there any that are just noise?"

Review the alert trigger criteria and notification groups to ensure operational costs are not too high

Datadog and Grafana Example

Datadog

The UI is comprehensive, making it easy to handle log-based custom metrics and latency/availability SLOs all at once.

Grafana

Supports a wide range of data sources including Prometheus and Cloud Monitoring
Flexible configuration is possible, such as integrating Alertmanager and linking with Terraform for IaC.

Why IaC Alert Management?

Change history management

You can track changes to alert rules using Git, etc.

Environmental Consistency

The same definition can be used in both production and staging environments.

High reproducibility

Can be quickly adapted to new projects and teams

Managing alerts as code also makes them easier to review with your team.

3. Introducing SLIs/SLOs

Now that we have gained trust, organized our alert infrastructure, and made sure our team is running smoothly, it's time to get to the heart of the matter.SLI/SLOWe will begin implementing the following.

Here's a quick introduction to key terminology and some tips to get you started.

Explanation of terms

CUJ (Critical User Journey)

The most important steps a user takes on the serviceThis is the definition.

For example, on an EC site, the sequence of steps would be "Top page → View product page → Add to cart → Complete payment."

By first determining the CUJ, it becomes clear which indicators should be measured (SLIs).

SLI (Service Level Indicator)

It is a quantitative indicator of the quality of a service.

For example, API response times, error rates, successful request rates, etc.

"Which metrics are needed to meet CUJ?HowIt is important for the team to agree on what to measure.

SLO (Service Level Objective)

This is a "target value" that indicates how close the SLI should be achieved.

For example, "99.9% of requests will respond within 3 seconds."

This target value gives you an idea of how stable your service is or if there is room for improvement.

Error Budget

This can be thought of as an indicator of the "tolerance" that sets the SLO target rather than setting it at 100%.

Example: "SLO 99.9% → tolerate a total error of 0.1% over 30 days"

It is important to create team rules, such as stopping additional development when the error budget is depleted and redirecting resources to restoring reliability.

Burn Rate

How quickly the error budget is being consumedIndicates

For example, with a 30-day goal, a burn rate of 2 would be "the pace at which you will burn through your error budget in 15 days."

Monitoring burn rate over multiple windows, both short-term (5 minutes) and long-term (1 hour), makes it easier to detect both sudden outages and chronic degradation.

Find team members

You can go it alone, but try to find a team of engineers who are open to implementing SLOs.

If they don't have one, it might be a good idea to approach them based on the trust you gained in step 1.

When working with members

I prefer members who can fill in the gaps I lack.

In this example, the person had a narrow perspective and was not good at leadership, so we looked for someone who could fill these gaps.

It might be even better if the person is well-known within the company.

If you're doing it alone

The first time I did SLO, dotmoney, I did it alone.

Since I didn't yet know enough about SLOs to be able to teach them to the other members, I worked on it alone in silence, and even decided, together with the implementation and business managers, what to do when the error budget was depleted.

I think it's also possible to work on it alone and gradually spread it among your engineers.

Start small

From CUJ to the introduction of SLO,There is no need to involve anyone other than PdM and engineers from the beginning.

It is important to first get a sense of SLI/SLO within the server team.

It would be effective to first get the server team up to a certain level of operation, and then use the benefits gained to expand to non-engineers.

Deciding on CUJ

It's best to start with something simple at first.

For example, we recommend starting with basic functionality like a login API.

You don't need to decide on a lot of CUJs right away. Three is enough at first. You can increase the number gradually.

Deciding on SLI

The best place to get the SLI is through the load balancer logs.

This allows you to get metrics closest to your users.

If it is difficult to obtain LB logs, you can use the metrics output by the backend.

When determining your SLI, it's important to be able to filter out automated requests such as curl or malicious requests.

Otherwise, you may end up with numbers that don't reflect the actual user experience.

Separate or joint latency and availability SLIs?

There are pros and cons to setting latency and availability SLIs separately or together.

	separately	together
merit	Separate, so cause identification and analysis are possible	One SLO makes it easy to understand the overall service quality
Disadvantages	More SLIs to manage	Detailed problem analysis may be difficult

Personally, I recommend setting them separately.

This is because it is easier to identify the root cause of a problem and lead to more detailed improvement activities. Also, because latency and availability have different characteristics, managing them separately makes it easier to take appropriate measures for each.

Decide on your SLOs

When deciding on your Target SLO, it is a good idea to set it to a value that is achievable with your current SLI. If your error budget continues to be at 100%, you can gradually adjust it.

I have a spreadsheet to help me organize my SLIs/SLOs.

We encourage you to use this in the early stages to manage your SLIs and SLOs internally.

💡

SLI Research and Organization Spreadsheet

SLI_SLO Copy for organization (1).xlsx

55.8KB

Setting alerts and error budget burn rates

Once you have implemented the SLO, you can set up alerts.

We recommend setting alerts on your error budget burn rate.

Cloud Monitoring and Datadog handle burn rate differently.

💡

Cloud Monitoring

select_slo_burn_rate

Getting SLO Data | Google Cloud Observability

SLO data is stored as a time series. You can retrieve the time series of your SLOs by specifying a time series selector in the filter parameter of the timeSeries.list method or by using the direct filter mode of the Metrics Explorer.

https://cloud.google.com/stackdriver/docs/solutions/slo-monitoring/api/timeseries-selectors?hl=ja#tss-burn-rate

💡

Datadog

You'll need to calculate the burn rate yourself, but Datadog's official documentation provides a table.

SLO 99.9% の場合のバーンレート — Burn rate with SLO 99.9%

Burn Rate Alerts

Use monitors to alert on SLO burn rates

https://docs.datadoghq.com/ja/service_management/service_level_objectives/burn_rate/

Grafana + Cloud Monitoring Query Example

Example queries in Datadog

5m + 1h の fast, slow で両方で > 14.4 の場合にアラートを発砲 — Fire an alert if both fast and slow of 5m + 1h are > 14.4

Fast and Slow Burn Rates

There are two types of burn rates: fast and slow.

Each has the following characteristics.

fast

Monitor your burn rate over a short lookback period, such as 5m.
The main purpose is to detect bugs during release, degradation during infrastructure changes, and degradation of external APIs.
Specifically, if Graceful Shutdown isn't working properly at the time of release, a fast burn rate is effective.

slow

Monitor your burn rate over a long lookback period, such as 1h.
The main purpose is to detect gradual deterioration (such as an increase in query latency due to the number of DB records, or n+1 of a new feature code that has been released) rather than detecting sudden deterioration in service quality of fast.

💡

Fast and slow combined burn rate alert

In the Datadog example, we use a combined alert of 5m (fast) and 1h (slow).

It will fire when both the 5m and 1h thresholds are reached, which helps to suppress alert noise.

But the benefit of immediacy is lost.

Alert Noise) How to deal with frequent alerts during periods of low requests

Late at night and early in the morning, the total number of requests (the denominator) is small, and the impact of bad requests is greater, so alerts are more likely to be triggered.

With this method, the fast + slow composite burn rate alert mentioned above is effective, but it loses the benefit of immediacy.

So what should we do?

In this example,

Even if the error budget is depleted, there is no need to respond during the middle of the night, so mute the burn rate alert outside of business hours.

This is the policy we have adopted.

We allow for a time lag when the error budget is exhausted, and this is covered in biweekly SLI/SLO retrospectives.

Compromise is also important.

4. Cultivating an SLO culture

Here are some steps to foster an SLO culture within your server team:

Conduct a team-wide training session on the basic concepts and importance of SLI/SLO

Schedule regular SLO reviews and have the whole team participate

Establish a procedure for responding to SLO violations and share it with your team.

Celebrate contributions to SLO improvements and share their success with the whole team

Make it a habit to consider the impact on SLOs when introducing new features or changes.

Team study sessions

First, we will hold study sessions and review sessions to ensure that the idea is absorbed within the team.

💡

SLI/SLO study materials for engineers

masked

Frequency of retrospective meetings

Holding retrospective meetings once a week can be too frequent and can lead to exhaustion.

It's important to set the right frequency based on the volume of SLOs and your team's resources.

In this case, SLO review meetings are held once every two weeks.

This provides a well-balanced frequency that allows you to grasp long-term trends without being overly concerned with short-term fluctuations.

Contents of the retrospective meeting

We recommend covering the following during your retrospective:

Check the current status of each SLO: Check the error budget consumption and burn rate

Analyze the cause and consider countermeasures in case of SLO violation

Consider proposing new SLIs/SLOs or adjusting existing SLOs

Check the progress of measures to improve SLO

Share knowledge and clarify questions about SLOs within your team

Specifically, we will look at dashboards such as Grafana and focus on discussing issues where error budgets are being depleted or where burn rates are high.

We also check the progress of improvements to SLOs that were problematic in the previous review. Through these discussions, the importance of SLOs can be understood by the entire team, and continuous improvement activities can be promoted to improve the quality of services.

Retrospective template

Here we introduce a retrospective template using Wrike.

SLI SLO Supported.xlsx

9.2KB

The important points are as follows:

SLI/SLO name (or error budget name)

category

Latency or Availability

Can it be made into a task?

There is no need to create a task when there are external factors that cannot be controlled.

Error Budget

We will update the values at the retrospective meeting.

Penetration to non-PdMs and non-engineers

The goal of spreading this to non-PdMs and non-engineers is to establish SLI/SLO as a common language across the service and leverage technical metrics to inform business decisions.

This allows you to simultaneously improve the quality of your services and achieve your business goals.

Furthermore, if the idea can spread to people other than PdMs and engineers, it may also lead to increased motivation for the server team, as they will be praised by people other than engineers.

💡

Materials for persuading non-engineers

masked

Definition of Penetration

This means that people other than PdM and engineers can independently use (speak up) about SLI/SLO and discuss it with the server team.

Specifically, this means that people other than PdMs and engineers will be able to understand the SLI/SLO numbers and make suggestions for improving and prioritizing services based on them.

Clarify your goal

It visualizes "what state would be necessary to attract people other than PdM and engineers and have them adopt the system?"

In this example, there was a business team member who was on good terms with the server team, so the aim was to have the concept spread to this person and then have it spread from there.

Key words to use when expanding to non-PdMs and non-engineers

The introduction of SLI/SLO is expected to improve service quality and customer satisfaction.

Studies by Amazon and Google have shown that improving latency by just 0.1 seconds can have a significant impact on sales.

Utilizing SLI/SLO not only provides a basis for decision-making when responding to failures, but in the long term it also improves service quality, reducing the frequency of failures and reducing the time constraints of people other than PdMs and engineers.

Let's work together to improve service reliability and drive business results

💡

I have attached the materials I used when actually explaining the project to people other than PdMs and engineers.

masked

5. Adjusting SLIs/SLOs

As we make quality improvements, it is not always healthy to have an error budget of 100%, so we adjust our SLIs/SLOs so that the error budget is exactly 0.

This allows us to balance improving service quality with technological challenges.

How to do a full SLO retrospective

The set SLOs will be reviewed every 3 or 6 months.

The purpose of this is to:

As we improve qualityThis creates a situation where there is excess margin in the error budget, so we need to correct that.

The SLO you set is too lenient,Remedying the error budget gap

You can get a list of SLOs that need to be corrected by using a tool called vigil, which will be described later.

Make a list of what needs to be corrected

The vigil script can output the SLOs that need to be corrected in an Excel file.

GitHub - rluisr/vigil: Identifies underutilized service-level objectives by detecting SLOs that consistently maintain error budget thresholds,generating actionable Excel reports with historical performance trends and optimization recommendations.

Identifies underutilized service-level objectives by detecting SLOs that consistently maintain error budget thresholds,generating actionable Excel reports with historical performance trends and opt...

https://github.com/rluisr/vigil

The SLOs extracted by vigil are:

A generous SLO where the error budget has never fallen below n% during the specified period

An SLO that has a negative error budget 50% of the time period

Usage example

The output Excel file is then copied into a spreadsheet or similar and discussions are held.

Things to check first

Please check that your query is correct.

💡

For example, check whether each Good/Total excludes access by bots and whether unintended requests are excluded.

Adjust the SLO or adjust the SLI?

Latency SLOs should not be used for SLO adjustments.

💡

Be careful with latency adjustments

Once you are satisfied that your queries are correct, you can adjust your availability SLOs.

Latency SLO adjusts the SLI.

SLI Tuning Guide

What you need to adjust SLI

Existing threshold

SLO
SLI (latency, status code if availability, etc.)

A graph of the corresponding SLI

Example of SLI adjustment by pattern

External APIs increase latency

overview

A temporary degradation occurred in the API that the service relies on externally. This caused the service's SLI to worsen and the error budget to be depleted, but it has since stabilized.

No action is required.

Unknown Pattern

overview

The SLI set when the SLO was added is strict, and the SLO is consistently below the limit.

This often happens when you add a new SLO.

In this case you should recheck the SLI.

1. Check if the SLI is correct

Check your current SLI.

Cloud Monitoring の場合 — For Cloud Monitoring

In this example, please check that not only the max 1600 threshold but also the conditions for using it as SLI are correct.

For example, in the case of a latency SLO, you expect a 2xx response and a response between 0 and 1600ms to be considered good, but it is possible that the status condition expression is incorrect and includes 5xx.

💡

For latency SLO, the SLO is fixed at 99.9%, so you cannot change the SLO. Adjust the latency side of the SLI.

2. Check the latency of the endpoint

Use Grafana or similar tools to observe latency from the endpoint defined in the SLI condition expression.

💡

By default, the Group by function is set to mean, so we will change it to 99th percentile.

In SLI, a response time of up to 1600ms is considered Good, but in reality, the average p99 value during the day exceeds 1600ms.

Therefore, it is recommended to set it to 1600ms or more.

💡

Since it is difficult to find the optimal value from the beginning, you can change it to 1800ms, and if the error budget is always at 100%, make it stricter by setting it to 1700ms.

Making change history clearer

If your SLOs move significantly and you don't know why, it's possible that you adjusted your SLI/SLOs.

The change history needs to be clear so that it is easy to follow.

You can use GitHub Actions to make it easier to track changes by automatically tagging commit messages when SLI/SLO changes are made.

This is one of the reasons why we bring alerts and SLIs/SLOs into IaC.

Actions yaml

Be careful with latency adjustments

For latency, we recommend adjusting the SLI response time.

You can also adjust it by adjusting the Target SLO, but some people might adjust the SLI response time while others adjust the Target SLO.

It is a good idea to decide in advance that you will "adjust based on response time" to avoid this kind of variation in work.

This helps ensure consistency across teams and makes managing SLIs/SLOs more efficient.

Flow from start to finish

Tips

What do Prometheus percentiles mean?

Looking at the code, the buckets are as follows: 25, 50, 100, 200, 400, 800, 1600, 3200, 6400

Here is an example of 100 requests distributed across buckets.

This distribution can be depicted graphically as follows:

To calculate p99, first accumulate the number of requests in each bucket.

99% of 100 requests equates to 99 requests. The first bucket where the cumulative requests are greater than or equal to 99 is bucket 7 (800ms - 1600ms), so p99 is 1600ms.

This means that out of 100 requests, 99% of requests were processed in 1600ms or less.

Conclusion

I hope this article will be of some help in introducing SLI/SLO.

Recruitment Information - CyberAgent SRG #ca_srg

About SRG SRG (Service Reliability Group) is working to improve reliability by promoting the introduction of SREs to the media business as a cross-sectional SRE with the vision of "improving reliability across the media business." The work is centered around the following three areas: Consolidating and deploying the technical know-how of each business

https://ca-srg.dev/careers