The importance of SRE was reaffirmed in the large-scale voting service "WINTICKET"

Mr. Hasegawa (Service Reliability Group (SRG) of the Media Headquarters)@rarirureluis)is.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
This article will introduce the benefits I gained from participating in WINTICKET as an Embedded SRE.
This article isCyberAgent Group SRE Advent Calendar 2024This is the 6th day's article.
 

SRE x Sales


For those who are unfamiliar with SRE, I will first explain the benefits of doing SRE.
The adoption of SRE contributes directly and indirectly to increased salesSpecifically, it will have a positive impact on sales in the following ways:

Increased Customer Lifetime Value (LTV)


SREs promote customer satisfaction by improving service stability, which can significantly improve LTV (customer lifetime value), especially for subscription-based services like SaaS.

Improved customer satisfaction through system reliability


Stable system operationThis will help you gain customer trust and increase your sales in the long term. In particular, the following factors influence your sales:
  • Reduced service downtime
  • Improved system response performance
  • Reduced error rate
  • Improved profitability through cost reduction
SRE is not just a system operation method,A strategic approach that directly and indirectly contributes to increasing corporate salesis.
If you happen to come across this article in a business role, please support any engineers who are trying to introduce SRE.

Correlation between SRE and sales


I think many people want to correlate the effectiveness of SRE with quantitative figures (sales). I thought the same thing, but it was quite difficult and I gave up.
Specifically, there is a correlation between "high sales = high server load = worsening SLO."
There are papers overseas that say SRE contributes to sales.
User-Engagement Score and SLIs/SLOs/SLAs Measurements Correlation of E-Business Projects Through Big Data Analysis
The Covid-19 crisis lockdown caused rapid transformation to remote working/learning modes and the need for e-commerce-, web-education-related projects development, and maintenance. However, an increase in internet traffic has a direct impact on infrastructure and software performance. We study the problem of accurate and quick web-project infrastructure issues/bottleneck/overload identification. The research aims to achieve and ensure the reliability and availability of a commerce/educational web project by providing system observability and Site Reliability Engineering (SRE) methods. In this research, we propose methods for technical condition assessment by applying the correlation of user-engagement score and Service Level Indicators (SLIs)/Service Level Objectives (SLOs)/Service Level Agreements (SLAs) measurements to identify user satisfaction types along with the infrastructure state. Our solution helps to improve content quality and, mainly, detect abnormal system behavior and poor infrastructure conditions. A straightforward interpretation of potential performance bottlenecks and vulnerabilities is achieved with the developed contingency table and correlation matrix for that purpose. We identify big data and system logs and metrics as the central sources that have performance issues during web-project usage. Throughout the analysis of an educational platform dataset, we found the main features of web-project content that have high user-engagement and provide value to services’ customers. According to our study, the usage and correlation of SLOs/SLAs with other critical metrics, such as user satisfaction or engagement improves early indication of potential system issues and avoids having users face them. These findings correspond to the concepts of SRE that focus on maintaining high service availability.
I think it would be quite difficult to do the same analysis as this paper.
So what motivates you to do SRE?
"We can proactively address the visibly deteriorating quality of service."is.
It's quite motivating when you can see the service quality visibly deteriorating.
 

SRE benefits gained from WINTICKET


First, I'd like to talk about the benefits of being an SRE at WINTICKET at this stage.

Ease of collaboration with businesses


SLI/SLO is very useful if you want to immediately share whether WINTICKET is affected when an external failure occurs.
ビジネスからの影響度に、SLI/SLO を共有することで連携がしやすい図
Sharing SLI/SLO based on business impact makes it easier to collaborate

Potential deterioration in service quality can be visualized


An alert was triggered for the error budget burn rate, and when we looked at the metrics, we saw that the service quality was gradually deteriorating and the error budget was being consumed rapidly.
今回 SLO 導入を一緒に進めている WINTICKET 最強エンジニアの1人 @taba2424
One of the strongest engineers at WINTICKET who is working with us on the SLO implementation@taba2424
This is not just limited to WINTICKET, but being able to see this kind of potential deterioration in service quality is something that would not be noticed without using SLI/SLO, and I think this is one of the benefits of being an SRE.
 

WINTICKET Service Introduction


WINTICKETwas released in 2019 as an internet betting service for publicly managed Keirin and Auto Races. The service's features include the ability to bet while watching race footage, and an extensive database of WINTICKET's own original data, including AI predictions and EX data.
We also offer functions linked to ABEMA's Keirin and Auto Race channels. WINTICKET became the number one Keirin betting service about two years after its release, and is still growing today.
 

self-introduction


Media Headquarters Service Reliability Group (SRG)@rarirureluis is.
The reason I introduced myself here is that I am not affiliated with WINTICKET.
In this Advent Calendar, as part of SRG, I would like to introduce my work as an Embedded SRE for another team, WINTICKET.
 

Embedded SRE


This is an activity (Enabling SRE) in which Site Reliability Engineers instill the culture and knowledge of SRE (Site Reliability Engineering) within development organizations, enabling developers themselves to practice SRE practices.
I'm on the SRG team, so I'm an Embedded SRE.
There is also the term "Enabling SRE," but I think it's roughly synonymous. (By the way, when I searched for "Enabling SRE" overseas, I didn't get any hits.)

the purpose


  • Spreading SRE culture and knowledge to product development teams (culture building)
  • Support developers to independently implement SRE practices (cultivating a culture)
  • Improving Service Reliability (SRE)
The ultimate goal of Enabling SRE is to develop members within each product team who can autonomously practice SRE.
 

To do SRE in another department


When I joined as an SRE, instead of just talking about SRE right away, I did various things to gain the trust of the service team.
Additionally, there is a possibility that the service side may become fatigued before it can reap the benefits of SRE, so the goal is to alleviate this somewhat by building up its own reliability first.
When I join a new team, my tasks are to set up monitoring and alerts, reduce Toil, and communicate.

SRE and Monitoring


The reason for setting up monitoring and alerting first has to do with SRE.
When the error budget runs out, there is no way to investigate without a proper monitoring environment.
That's why Monitoring is the foundation of the pyramid diagram you often see.

Personal monitoring and alerting tools


WINTICKET uses Google Managed Prometheus and Grafana, but I personally found Datadog easier to use.
Before joining WINTICKET, I used Datadog for a service called DotMoney. Its graphical UI and wide range of ways to define SLIs (log-based, SLIs that treat latency and availability equally) are features that Cloud Monitoring does not have.
 
Although I haven't used them in actual operations, I got the impression that SigNoz, which is open source software, and Grafana Cloud are easy to use if you are okay with small-scale deployments or SaaS.
💡
Although the article states that it cannot be used from an SRE perspective, a recent update has made it possible to use Range Vector Selectors like those found in PromQL.Can it be used for SRE purposes?I think so.
 

Alert maintenance


Initially, the alert environment was in place, but there were several issues. The alert tools were split between Cloud Monitoring and Self-hosted Alertmanager, and the same alerts were defined in both, or unnecessary alerts were included. This meant that alert maintenance was insufficient, reducing the effectiveness of fault detection.
So we worked on the following two points.
  1. Unified alert system
    1. The alert tool was unified with Grafana, which was already being used as a visualization tool.
  1. Review alert definitions with your team
    1. List all alerts and align the necessity and importance of the alerts with team members
These efforts have resulted in a simpler and more effective alert structure and improved monitoring practices.

Unified alert system


Since we are already using Grafana, we decided to unify our alert system with Grafana as well.
Cloud Monitoring's alerting system is less flexible than Grafana's.
There were nearly 400 alerts to migrate, and we had to select which ones to keep and which to discard.
Delete alert rules that can be supplemented with SLI/SLO, reduce the number of alerts that require on-call...
Grafana's label-based alert rules (Notification Policy) are intuitive, and you can easily make them on-call by simply labeling only the alerts you want to handle.
In addition to the above, we gained four other benefits.
  • WINTICKET also uses AWS, making it easy to monitor using Grafana.
  • You no longer need to set up Alertmanager (GKE) with port forwarding.
  • Client team alerts can now be managed with Grafana
  • Added flexibility to alert notifications
    • Grafana Notification Template による情報の整理と出力
      Organizing and outputting information with Grafana Notification Templates

Grafana study session


The purpose of this workshop is to provide knowledge about Grafana to the team and enable them to easily add alerts from Terraform.
That said, an ulterior motive is to increase my profile as an SRE within my team.
 

Toil reduction


Reducing Toil is a quick way to gain credibility.
for example,
  • Completely unmaintained IaC
  • Long-running deployment flow
is.
Reducing toil allows you to understand the infrastructure configuration and deployment flow of a service, so if you find toil, you're lucky. Even if you don't find any, you can still understand the configuration, so it's worth trying.
 

Cost reduction


I think reducing costs is also a quick way to gain trust.
The advantage over Toil reduction is that it can be easier in some cases and more effective because the cost savings are good news for business teams.
 

communication


Participate in daily server team evening meetings, regular events, drinking parties, etc.
It's small, but it adds up.
 

WINTICKET system configuration


As an overview, this is the WINTICKET system configuration diagram. (In reality, it is multi-regional and quite large-scale.)
 

Monitoring architecture


The monitoring architecture has been unified to Grafana, so Alertmanager has disappeared and it has become simpler.

Before

After

 

Implementing SLI/SLO


When it comes time to actually implement SLI/SLO, we will work with engineers on the team who are interested in SLI/SLO.
You can work alone, but having someone who knows the team's situation well next to you can help things go a little more smoothly, so if possible, it's best to work with a team.
Not only will the process proceed more smoothly, but the members will also gain knowledge about SLI/SLO, which will be more effective when it comes to spreading the knowledge throughout the team.
This time@taba2424We proceeded as follows.

CUJ


WINTICKET was already operating SLI/SLO in its app, so we reflected that on the server side as well.
If you would like to know more about CUJ, please read this article.

Identifying SLIs/SLOs


Once the CUJ is decided, we will begin to identify the SLI.
We will summarize the current latency and error rate in a spreadsheet like the one below and organize them accordingly.
In situations like this, it goes more smoothly if you have someone on your team with you.
WINTICKET 最強エンジニアの1人 @taba2424 作
One of WINTICKET's strongest engineers@taba2424 Made by

Adding and Implementing SLOs


Once the identification is complete, we will actually implement the SLO.
We use Cloud Monitoring's SLO feature and operate the SLO dashboard with Grafana.
First, let me introduceCloud Monitoring dashboard can only display 100 SLOs per service(Actually, more than 100 can be registered.) Therefore, we visualize the SLOs using Grafana.
Server metrics are also visualized using Grafana, which makes it easy to avoid having to switch between tools.
 

SLI is good quality


When you operate a service, you may encounter access from crawlers or malicious users.
By playing such requests in advance,High-quality SLOcan be measured.
WINTICKET uses Cloud Armor to take various measures to prevent malicious requests from reaching its microservices.
Specifically, we use rate limiting and Google Cloud Armor's pre-configured WAF rules to detect and quickly reject malicious requests. This prevents inappropriate requests from being detected, maintaining the health of the system and further improving the reliability of SLIs/SLOs.

Metrics used for SLI


The metrics used as SLIs at WINTICKET are as follows:
  • Prometheus metrics emitted by microservices
  • GCP metrics collected by Cloud Monitoring
 

Availability and Latency SLO


WINTICKET's SLI/SLO sets latency and availability as independent SLOs.
Personally, I think this is common in the world, but you can also combine latency and availability and measure them as a single SLO.
In fact, DotMoney, which I joined before WINTICKET mentioned above, uses synthesized SLOs.
Cloud Monitoring's SLOs do not allow you to combine multiple metrics like Datadog does.

Which is better?


I think you should also take into account the characteristics of the service, but generally it's easier to keep them separate.
Together: Easy to manage
Separate: Availability and latency are separate, making it easier to investigate.
If they are separate, the number of SLOs to manage will increase.GranularityBecause it's low, it's much easier when reviewing SLOs.
 

Target Window


The Target Window is determined by the regular frequency and development cycle.
The deployment cycle for WINTICKET itself is approximately once a week, but the Target Window is set to 30 days, and SLO review meetings are held with the entire team once every two weeks.
Ideally, if your service deployments are every Wednesday, you could set up regular SLO meetings every Thursday with a one-week Target Window, allowing you to discuss changes to SLOs and error budgets due to feature releases.
However, frequent reviews are unlikely to make users realize the effects of SLO, so we deliberately set the review period to be longer to identify potential degradation and make users feel that it was a good idea to implement SLO.
Also, considering how things will behave when the error budget runs out, which I will explain in the next section, I think it might be quite difficult to set the Target Window to one week.
 

Error Budget


Error budgets are derived from SLOsAcceptable loss of reliabilityis.
日々増減するエラーバジェットの図
A diagram of the error budget, which fluctuates daily
Servers may temporarily violate their SLO due to high database load, etc.
It's an idea of how much of that violated SLO you can tolerate.

Error budget operations


While it may be praised that the error budget is not being consumed, in other words, it can be seen as "is it because there are fewer deployments?" or "are they not taking on technical challenges?"
The error budget is also the budget for "technical challenges" allocated to the Target SLO.
If you have excess error budget, tighten your SLO to ensure it is exactly 0% of the target window.

How to calculate the error budget


The error budget is calculated from the SLO target values.
For example, if your SLO is 99.9%, your error budget is 0.1%. If your Target Window is 30 days (43,200 minutes), your error budget equates to 43.2 minutes of downtime.
Specific calculation example:If the SLO is 99.9%
  • SLO: 0.999
  • Target Window: 30 days (43,200 minutes)
=(1−0.999)×43,200=0.001×43,200=43.2
 

Error Budget Burn Rate


Burn rate is a term coined by Google that is a unitless value that indicates how quickly your error budget is consumed relative to the target length of your SLO. For example, if your target is 30 days, a burn rate of 1 means that at a constant rate of 1, your error budget will be completely consumed in exactly 30 days. A burn rate of 2 means that at a constant rate, your error budget will be depleted in 15 days, and a burn rate of 3 means that your error budget will be depleted in 10 days.
Datadog's documentation on burn rate is easy to understand.
WINTICKET sets alerts on burn rate for one SLO with 5-minute and 1-hour time windows.
These will be referred to as fast burn rate and slow burn rate, respectively.

fast burn rate


The purpose is to detect if the issue fires after a release, indicating that the quality of the service has been degraded due to the changed code.
Although it is not yet defined in the flow within WINTICKET, we would like to make it possible to use it as a criterion for deciding whether to roll back after a canary release.

slow burn rate


This is an important warning sign that indicates a chronic deterioration in the system's service quality, rather than a sudden failure like a fast burn rate.
It does not require immediate action, but it is an indicator that should be continually monitored.
 

Latency is slow only, availability varies depending on the nature


For the burn rate alert for availability SLO, we set the alerts to fast and slow as mentioned above, but for latency we only set slow.
可用性アラート
Availability Alerts
レイテンシアラート
Latency Alerts
One reason we don't apply a fast burn rate to latency is that it can easily result in noisy alerts.
If the endpoint being measured depends on an external API, we do not apply a fast burn rate to the latency because it will be affected by the latency of that external API.
slow will give you good results.
On the other hand, availability is set to take advantage of the characteristics of both fast and slow.

Burn rate alert at midnight


With a short time window, simple burn rate alerts often occur during times of low request volume, such as late at night.
For example, if your system receives 10 requests per hour, one failed request results in a 10% error rate per hour. With a 99.9% SLO, this request would result in a 1,000x burn rate, consuming 13.9% of your 30-day error budget and immediately triggering an alert.
WINTICKET Appチーム anies1212 作
WINTICKET App Teamanies1212 Made by
WINTICKET mitigates this somewhat by setting the burn rate threshold at 3 or 20.
Another approachsre.googleWhat was introduced was a method to artificially create normal requests.
This is a way to mitigate the impact of error requests even during times when there are few requests.
It seems a little unrealistic, but I think it's interesting.
There are many other approaches introduced here.

If you want a more accurate burn rate


Combining multiple windows and multiple burn rates can eliminate false positives.
In this example, the alert will fire when the burn rate reaches 14.4 (14.4: 2% of the error budget consumed) over both the 5 minute and 1 hour intervals.
This will eliminate the benefit of immediacy, but it will allow you to configure essential alerts that indicate deterioration in service quality, excluding noisy alerts that will quickly recover.
 

When the error budget is depleted


Last time, I participated in DotMoney as an Embedded SRE, and I had a discussion with the business manager."Except for responding to production outages, fixes to restore reliability, and feature releases involving external companies, we prohibit feature releases when the error budget is exhausted."We were able to agree.
At WINTICKET, when the error budget is depleted, the server team determines whether the cause is external, and if not, cuts tasks and assigns members to restore the error budget. This creates a unique culture within the team.
 

SLI/SLO penetration


Now we've reached the point where we can actually code and visualize the SLIs/SLOs.
We create Grafana dashboards for each component.
 

What to visualize


What is the purpose of visualization?
The following items are required when reviewing SLOs:
  • Error Budget
  • Current SLO
各コンポーネントの SLO サマリ
SLO summary for each component
But this is not enough.
The error budget is depleted, and more information is needed to dig deeper.
  • Error Budget Time Series Graph
  • SLI Time Series Graph
  • Time Series Latency Graph (Latency) for the SLI
  • Time Series response code (availability) for the applicable SLI
 
各コンポーネントの SLO の詳細
SLO details for each component
By deploying this in the same dashboard, you can enjoy the following benefits when reviewing:
  • You can determine when the condition worsened or improved.
    • If it overlaps with the release, that's the reason
    • External service failure
  • Can determine if the condition is continuing to worsen
  • If the SLO has deteriorated, you can take measures to determine whether it is improving.
  • There are no other issues, and the set SLI/SLO is strict, so you can make a decision to adjust it.

Conducting study sessions


Conduct training sessions for the whole team.
The purpose is to help people understand SLI/SLO at least a little, but there is no way that they can understand it in a one-time study session.
I didn't understand it either, so we shared things like, "This is what we're going to do!" and "These are the benefits!" and did it in a sort of rallying cry kind of atmosphere.
@taba2424 作
@taba2424 Made by

Review SLOs with the whole team every two weeks


The purpose of reviewing SLOs with the entire team is to:
  • Creating a culture of SLI/SLO through team-wide efforts
  • (If service quality actually deteriorates) Help employees understand the benefits of SLI/SLO and maintain their motivation

How to review on WINTICKET


At the time of writing, WINTICKET has 103 SLOs.
It would be quite exhausting to review all of this.
Therefore, the review looks at "only the SLOs where the error budget was depleted + what happened to the SLOs where the error budget was depleted at the time of the previous review."
This allows you to start with a simple operation that won't tire you out.
💡
Ultimately, we aim to reach a point where we agree with business roles on how to act when the error budget is depleted, and someone on the server team takes action to restore the error budget when it is depleted.
💡
What should we do about SLOs that don't consume any of the error budget? At WINTICKET, we identify cases where the error budget is hovering around 100% every three or six months, and tighten our SLIs and SLOs to prevent the error budget from becoming excessive.
 
WINTICKET uses Wrike for task management, so we also use Wrike for SLO retrospectives.
We try to keep the number of tools to a minimum and aim to operate in a way that doesn't tire out our members.
Wrike での SLO タスク管理
SLO Task Management in Wrike
The actual review process is as follows:
  1. Randomly select a facilitator
  1. Create a team for each category
  1. Teams can view the SLOs for their assigned categories in Grafana.
  1. Wake up the error budget exhaustion in Wrike
  1. Judgment on whether or not to respond
    1. If the problem is with the external API, no action is required.
  1. Check the status of SLOs already created in Wrike (error budget exhausted)
    1. Comment and update the error budget field
 

In fact, the quality of service is getting worse every day.


As mentioned earlier, the quality of service is getting worse every day.
For example, latency increases as the number of DB records increases.
Or maybe you added a table as part of a new initiative, but there was an index error.
Even if there is no problem at first, the impact will become obvious as the number of records increases.
日々悪化していく例
Cases that get worse every day

Why has SLI/SLO become so widespread?


Looking at the current state of WINTICKET adoption, it appears that members other than those who promoted SLI/SLO have begun to look for the causes of deterioration using burn rate alerts and the SLO dashboard, and that the server team is independently implementing SLI/SLO.
In fact, I was the facilitator for the first review meeting, but now someone from the server team is taking over.
For now, we've been able to get this far because we've been working together.@taba2424It was also reassuring because he was so excellent.
In addition, the WINTICKET development manager@akihisasenHowever, I think it was also important that they had been interested in SLI/SLO from the beginning, knew its benefits, and created a structure that made it easy to move forward.
 

Conclusion: The future and ideals of SLI/SLO


Currently, SLI/SLO isA common language within the server teamHowever, it has not yet achieved its original goal of "creating a common language with business."
The next step for the SRE team is to spread the knowledge to business roles. To achieve this, we plan to take the following approach:
  • Improved comprehensiveness
    • Adding a new SLO
    • Increased confidence in SLOs (e.g., SLI adjustments)
  • Involve someone from the business team in SLO review meetings with the server team
 
We have also created a business dashboard that allows business people to see the current SLOs in an easy-to-understand manner.
We will utilize these to gradually spread the word.
 

The need for business penetration and a culture where those who improve SLOs are praised


I believe that anyone who improves service quality should be praised and recognized not only by engineers, but by everyone involved with the product.
However, WINTICKET currently has no agreement with the business side.
First, we are currently working on improving the reliability and comprehensiveness of SLOs while implementing and promoting SLOs across the entire server team.
 

Chat Embedded SRE? Enabling SRE?


I've been hearing the term "Enabling SRE" a lot recently, so I looked into it, but I haven't seen it anywhere overseas.
X's community "SRE and observabilityWhen I asked a question, I received a lot of information.
There seems to be a description of "Enabling" in the book "Team Topology: Adaptive Organizational Design for Rapid Delivery of Valuable Software."
 
SRG is looking for people to work with us. If you're interested, please contact us here.