A story about the importance of SRE in the large-scale voting service "WINTICKET"

Mr. Hasegawa (@rarirureluis)is.
#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.
In this article, I will introduce the benefits I gained from participating in WINTICKET as an Embedded SRE.
This article isCyberAgent Group SRE Advent Calendar 2024This is the 6th day's article.
 

SRE x Sales


For those who are unfamiliar with SRE, I will first explain the benefits of doing SRE.
The introduction of SRE contributes directly and indirectly to increased salesSpecifically, it will have a positive impact on sales in the following ways:

Increase Customer Lifetime Value (LTV)


SREs promote customer satisfaction by improving the stability of services, which can significantly improve LTV (customer lifetime value), especially in subscription-based services such as SaaS.

Improved customer satisfaction through system reliability


Stable system operationThis will help you gain customer trust and increase your sales in the long term. In particular, the following factors will affect your sales:
  • Reduce service downtime
  • Improved system response performance
  • Reduce error rates
  • Improve profitability through cost reduction
SRE is not just a system operation methodology.A strategic approach that directly and indirectly contributes to increasing corporate salesis.
If you happen to come across this article in a business role, please be there for any engineers who are trying to introduce SRE.

Correlation between SRE and sales


I think many people want to correlate the effectiveness of SRE with quantitative figures (sales). I thought the same, but it was quite difficult and I gave up.
Specifically, there is a correlation between "high sales = high server load = worsening SLO."
There are papers overseas that suggest that SRE contributes to sales.
User-Engagement Score and SLIs/SLOs/SLAs Measurements Correlation of E-Business Projects Through Big Data Analysis
The Covid-19 crisis lockdown caused rapid transformation to remote working/learning modes and the need for e-commerce-, web-education-related projects development, and maintenance. However, an increase in internet traffic has a direct impact on infrastructure and software performance. We study the problem of accurate and quick web-project infrastructure issues/bottleneck/overload identification. The research aims to achieve and ensure the reliability and availability of a commerce/educational web project by providing system observability and Site Reliability Engineering (SRE) methods. In this research, we propose methods for technical condition assessment by applying the correlation of user-engagement score and Service Level Indicators (SLIs)/Service Level Objectives (SLOs)/Service Level Agreements (SLAs) measurements to identify user satisfaction types along with the infrastructure state. Our solution helps to improve content quality and, mainly, detect abnormal system behavior and poor infrastructure conditions. A straightforward interpretation of potential performance bottlenecks and vulnerabilities is achieved with the developed contingency table and correlation matrix for that purpose. We identify big data and system logs and metrics as the central sources that have performance issues during web-project usage. Throughout the analysis of an educational platform dataset, we found the main features of web-project content that have high user-engagement and provide value to services’ customers. According to our study, the usage and correlation of SLOs/SLAs with other critical metrics, such as user satisfaction or engagement improves early indication of potential system issues and avoids having users face them. These findings correspond to the concepts of SRE that focus on maintaining high service availability.
I think it would be quite difficult to perform the same analysis as this paper.
So what motivates you to do SRE?
"We can deal with the visibly deteriorating quality of service in advance."is.
It's kind of motivating when you can see the service quality visibly deteriorating.
 

SRE Benefits Obtained by WINTICKET


First, I would like to talk about the benefits of being an SRE at WINTICKET at this stage.

Ease of business collaboration


SLI/SLO is very useful if you want to immediately share whether WINTICKET is affected when some external failure occurs.
ビジネスからの影響度に、SLI/SLO を共有することで連携がしやすい図
Diagram showing how sharing SLI/SLO based on business impact can facilitate collaboration

Potential deterioration in service quality can be visualized


An alert was triggered for the error budget burn rate, and when we looked at the metrics, we saw that the service quality was gradually deteriorating and the error budget was being consumed rapidly.
今回 SLO 導入を一緒に進めている WINTICKET 最強エンジニアの1人 @taba2424
One of the best engineers at WINTICKET who is helping us introduce SLO@taba2424
This is not limited to WINTICKET; being able to see this kind of potential deterioration in service quality is something that would not be noticed without using SLI/SLO, and I think this is one of the benefits of being an SRE.
 

WINTICKET Service Introduction


WINTICKETwas launched in 2019 as an internet betting service for publicly-run Keirin and Auto Race events. The service's features include the ability to bet while watching race footage, and an extensive database of WINTICKET original data, including AI predictions and EX data.
We also provide functions linked to ABEMA's Keirin and Auto Race channels. WINTICKET became the No. 1 Keirin betting service about two years after its release, and is still growing today.
 

self-introduction


Media Headquarters Service Reliability Group (SRG)@rarirureluis is.
The reason I am introducing myself here is because I am not affiliated with WINTICKET.
In this advent calendar, as part of the SRG, I would like to introduce you to my work as an Embedded SRE for another team, WINTICKET.
 

Embedded SRE


This is an activity (Enabling SRE) in which Site Reliability Engineers instill the culture and knowledge of SRE (Site Reliability Engineering) within development organizations, enabling developers themselves to practice SRE practices.
Since I'm on the SRG team, I'm an Embedded SRE.
There is also the term "Enabling SRE," but I think it is roughly synonymous. (By the way, when I searched for the term "Enabling SRE" overseas, I got no hits.)

the purpose


  • Spreading SRE culture and knowledge to product development teams (culture building)
  • Support developers to implement SRE practices voluntarily (culture building)
  • Improving Service Reliability (SRE)
The ultimate goal of Enabling SRE is to develop members within each product team who can practice SRE autonomously.
 

To do SRE in another department


When I joined as an SRE, rather than just starting to talk about SRE, I did various things to gain the trust of the service team.
Additionally, there is a risk that the service side may become fatigued before it can enjoy the benefits of SRE, so the aim is to alleviate this somewhat by building up its own reliability first.
When I join a new team, the first thing I do is set up monitoring and alerts, reduce toilet seating, and communicate.

SRE and Monitoring


The reason for setting up monitoring and alerting first has to do with SRE.
When your error budget runs out, there is no way to investigate without a proper monitoring environment.
That's why Monitoring is the foundation of the pyramid diagram you often see.

Personal monitoring and alerting tools


At WINTICKET we use Google Managed Prometheus and Grafana, but I personally find Datadog easier to use.
Before joining WINTICKET, I used Datadog for a service called DotMoney. Its graphical UI and wide range of ways to define SLIs (log-based, SLIs that treat latency and availability the same) are features not available in Cloud Monitoring.
 
Although I haven't used it in actual operations, I got the impression that SigNoz, which is open source, and Grafana Cloud, which is small-scale or SaaS-friendly, are easy to use.
💡
The article states that it cannot be used from an SRE perspective, but a recent update has added support for Range Vector Selectors like those found in PromQL.Can it be used for SRE purposes?I think so.
 

Alert maintenance


Initially, the alert environment was in place, but there were some issues. The alert tools were split between Cloud Monitoring and Self-hosted Alertmanager, and the same alerts were defined in both, or unnecessary alerts were included. In this state, alert maintenance was insufficient, and the effectiveness of fault detection was reduced.
So we worked on the following two points.
  1. Unified alert system
    1. The alert tool was unified with Grafana, which was already being used as a visualization tool.
  1. Review alert definitions with your team
    1. List all alerts and align the necessity and importance of the alerts with team members
These efforts have resulted in a simpler, more effective alert structure and improved monitoring operations.

Unified alert system


Since we are already using Grafana, we decided to unify our alert system with Grafana as well.
Cloud Monitoring's alerting system is less flexible than Grafana.
There were nearly 400 alerts to migrate, and we had to carefully select which ones to migrate.
Delete alert rules that can be supplemented with SLI/SLO, reduce the number of on-call alerts, etc.
Grafana's label-based alert rules (notification policies) are intuitive, and you can easily make this happen by labeling only the alerts you want to handle on-call.
In addition to the above, we gained four other benefits.
  • WINTICKET also uses AWS, making it easy to monitor using Grafana.
  • No need to set up Alertmanager (GKE) with Port Forwarding anymore
  • Grafana can now be used to manage alerts for client teams.
  • Added flexibility to alert notifications
    • Grafana Notification Template による情報の整理と出力
      Organizing and outputting information with the Grafana Notification Template

Grafana study session


The purpose of this workshop is to provide the team with knowledge about Grafana and to enable them to easily add alerts from Terraform.
That said, one of the ulterior motives is to increase my profile as an SRE within my team.
 

Toil reduction


Reducing toilet is a quick way to gain credibility.
for example,
  • Completely unmaintained IaC
  • Long-running deployment flow
is.
Toil reduction allows you to understand the infrastructure configuration and deployment flow of a service, so if you find any toilets you're lucky, but even if you don't, you'll still be able to understand the configuration, so it's worth trying.
 

Cost reduction


I think reducing costs is also a quick way to gain trust.
The advantage over toilet reduction is that in some cases it can be easier and more effective because the cost savings are good news for business teams.
 

communication


Participate in the daily server team evening meetings, regular events, drinking parties, etc.
It's small, but it adds up.
 

WINTICKET system configuration


This is an overview of the WINTICKET system configuration. (In reality, it is multi-regional and quite large-scale.)
 

Monitoring architecture


The monitoring architecture has been unified into Grafana, so Alertmanager has disappeared and it has become simpler.

Before

After

 

Implementing SLI/SLO


When it comes time to actually implement SLI/SLO, we will work with engineers on our team who are interested in SLI/SLO.
You can work alone, but having someone who knows the team's situation well next to you can make the process go more smoothly, so it's best to work with a team if possible.
Not only will it make things go more smoothly, but it will also help each team member gain knowledge about SLI/SLO, making it more effective when it comes to spreading the knowledge throughout the team.
This time@taba2424And so we proceeded.

CUJ


WINTICKET was already operating SLI/SLO on its app, so we reflected that on the server side as well.
If you would like to know more about CUJ, please read this article.

Identifying SLIs/SLOs


Once CUJ has been decided, we will begin to look into SLI.
We will summarize the current latency and error rates in a spreadsheet like the one below and organize them separately.
In situations like this, having someone on your team makes things go more smoothly.
WINTICKET 最強エンジニアの1人 @taba2424 作
One of WINTICKET's strongest engineers@taba2424 Made by

Adding and Implementing SLOs


Once the identification is complete, we can actually implement the SLOs.
We use the SLO feature of Cloud Monitoring and use Grafana to run the SLO dashboard.
First, let me introduceCloud Monitoring dashboard can only display 100 SLOs per service(Actually, more than 100 can be registered.) We visualize the SLOs using Grafana.
Server metrics are also visualized using Grafana, which makes it easy to use without having to switch between tools.
 

SLI is good quality


When operating a service, crawlers and malicious users may access it.
By playing such requests in advance,High quality SLOcan be measured.
At WINTICKET, we use Cloud Armor to take various measures to prevent malicious requests from reaching our microservices.
Specifically, we use rate limiting and pre-configured Google Cloud Armor WAF rules to detect and quickly reject malicious requests. This helps us maintain the health of our systems by blocking inappropriate requests, while further improving the reliability of our SLIs/SLOs.

Metrics used for SLI


The metrics used as SLIs at WINTICKET are as follows:
  • Prometheus metrics emitted by microservices
  • GCP metrics collected by Cloud Monitoring
 

Availability and Latency SLOs


WINTICKET's SLI/SLO sets latency and availability as independent SLOs.
Personally, I believe this is common practice, but you can also combine latency and availability and measure it as a single SLO.
In fact, DotMoney, which I joined before WINTICKET mentioned above, uses synthetic SLOs.
Cloud Monitoring's SLOs do not allow you to combine multiple metrics like Datadog does.

Which is better?


I think you should also take into account the characteristics of the service, but generally it's easier to keep them separate.
Together: Easy to manage
Separate: Availability and latency are separate, making it easier to investigate.
If they are separate, the number of SLOs to be managed will increase.GranularityThis is low, so it's pretty easy when reviewing SLOs.
 

Target Window


The Target Window is determined by your regular frequency and development cycle.
The deployment cycle for WINTICKET itself is approximately once a week, but the Target Window is set to 30 days, and SLO review meetings are held with the entire team once every two weeks.
Ideally, if your service is deployed every Wednesday, you could set up a regular SLO schedule every Thursday with a one-week Target Window, allowing you to discuss changes to SLOs and error budgets due to feature releases.
However, frequent reviews are less likely to allow users to feel the effects of SLO, so we deliberately set the span to be longer in order to identify potential degradation and make users feel glad that they implemented SLO.
Also, considering how things will behave when the error budget runs out, which I will explain in the next section, I think setting the Target Window to one week may be quite strict.
 

Error Budget


Error budgets are derived from SLOsAcceptable loss of reliabilityis.
日々増減するエラーバジェットの図
A diagram of the error budget, which increases and decreases daily
Servers may temporarily violate their SLO due to high database load, etc.
It's an idea of how much of a violated SLO you can tolerate.

Working with error budgets


You may be praised for not spending the error budget, but in other words, you can change your perspective to "are there fewer deployments?" or "are they not taking on technical challenges?"
The error budget is also the "technical challenge" budget allocated against the Target SLO.
If you have error budget remaining, tighten your SLO so that it is exactly 0% of the target window.

How to calculate the error budget


The error budget is calculated from the SLO targets.
For example, if your SLO is 99.9%, your error budget is 0.1%. If your target window is 30 days (43,200 minutes), your error budget equates to 43.2 minutes of downtime.
Specific calculation example:If the SLO is 99.9%
  • SLO: 0.999
  • Target Window: 30 days (43,200 minutes)
=(1−0.999)×43,200=0.001×43,200=43.2
 

Error Budget Burn Rate


Burn rate is a term coined by Google and is a unitless value that indicates how quickly the error budget is consumed relative to the target length of the SLO. For example, if your goal is 30 days, a burn rate of 1 means that a constant rate of 1 would completely consume your error budget in exactly 30 days. A constant burn rate of 2 means that it would take 15 days, and a burn rate of 3 means that it would take 10 days to deplete your error budget.
Datadog's documentation on burn rate is very clear.
At WINTICKET, we set alerts on burn rate for one SLO with 5 minute and 1 hour time windows.
From now on, these will be referred to as fast burn rate and slow burn rate, respectively.

fast burn rate


The purpose is to detect if a bug occurs after a release and the quality of the service is degraded due to the changed code.
Although it is not yet defined in the flow within WINTICKET, we would like to make it possible to use it as a criterion for deciding whether to revert after a canary release.

slow burn rate


This is not a sudden failure like a fast burn rate, but an important warning sign that indicates a chronic deterioration in the system's service quality.
While no immediate action is required, it is an indicator that you should continue to be aware of.
 

Latency is slow only, availability varies depending on the nature


For the burn rate alerts on availability SLOs, we set the alerts to fast and slow as mentioned above, but for latency we only set it to slow.
可用性アラート
Availability Alerts
レイテンシアラート
Latency Alerts
One of the reasons we don't apply a fast burn rate to latency is the issue of noisy alerts.
If the endpoint being measured relies on an external API, we do not apply a fast burn rate to the latency because it will be affected by the latency of that external API.
Slow will give you good results.
On the other hand, availability is set up to take advantage of both fast and slow characteristics.

Burn rate alert at midnight


With a short time window, simple burn rate alerts often fire at times when there are fewer requests, such as late at night.
For example, if your system receives 10 requests per hour, one failed request would result in a 10% error rate per hour. With a 99.9% SLO, this request would result in 1,000x burn rate and consume 13.9% of your 30-day error budget, thus firing an alert immediately.
WINTICKET Appチーム anies1212 作
WINTICKET App Teamanies1212 Made by
WINTICKET mitigates this somewhat by using burn rate thresholds of 3 or 20.
Another approachsre.googleWhat was introduced was a method to artificially create normal requests.
This is a way to reduce the impact of error requests, even during times of low request volume.
It seems a little unrealistic, but I think it's interesting.
There are many other approaches presented here.

If you want a more accurate burn rate


Combining multiple windows and multiple burn rates can help eliminate false positives.
In this example, the alert will fire when the burn rate reaches 14.4 (14.4: 2% error budget consumed) over both the 5 minute and 1 hour periods.
This eliminates the benefit of immediacy, but it allows you to configure essential alerts that indicate deterioration in service quality, eliminating noisy alerts that will quickly recover.
 

When the error budget is exhausted


Last time I participated in DotMoney as an Embedded SRE, I had a discussion with the business manager."Except for production outages, fixes to restore reliability, and releases involving external companies, we will prohibit feature releases when the error budget is exhausted."We were able to come to an agreement.
At WINTICKET, when the error budget runs out, we have formed a unique culture within the server team where we determine whether the cause is external, and if not, we cut tasks and assign members to restore the error budget.
 

The spread of SLI/SLO


Now we have reached the point where we can actually code and visualize the SLI/SLOs.
We create Grafana dashboards for each component.
 

What to visualize


What is the purpose of visualization?
The following items are required when reviewing SLOs:
  • Error Budget
  • Current SLO
各コンポーネントの SLO サマリ
SLO summary for each component
But this is not enough.
The error budget is depleted, or more information is needed to dig deeper.
  • Error budget time series graph
  • SLI Time Series Graph
  • Time Series Latency Graph for the SLI (Latency)
  • Time Series response code for the applicable SLI (availability)
 
各コンポーネントの SLO の詳細
SLO details for each component
By deploying this in the same dashboard, you can enjoy the following benefits when reviewing:
  • You can determine when your condition worsened or improved.
    • If it overlaps with the release, that's the reason
    • External service failure
  • Can determine whether the condition is continuing to deteriorate
  • If the SLO has deteriorated, you can take measures to determine whether it is improving.
  • There are no other issues, and the SLI/SLO you set is strict, so you can make the decision to make adjustments.

Conducting study sessions


Conduct training sessions with the whole team.
The purpose is to help people understand SLI/SLO at least a little, but it is impossible to understand it in a one-time study session.
I didn't understand it either, so we shared things like, "This is what we're going to do!" and "These are the benefits!" and we had a sort of rallying call type of atmosphere.
@taba2424 作
@taba2424 Made by

Review SLOs with the whole team every two weeks


The purpose of reviewing SLOs with the entire team is to:
  • Creating a culture of SLI/SLO through team-wide efforts
  • (If service quality actually deteriorates) Help users realize the benefits of SLI/SLO and keep them motivated

How to do reviews on WINTICKET


At the time of writing this article, WINTICKET has 103 SLOs.
It would be exhausting to review all of this.
Therefore, the review looks at "only the SLOs where the error budget has been depleted + what happened to the SLOs where the error budget was depleted at the time of the previous review."
This allows you to start out with a simple operation that won't tire you out.
💡
Ultimately, we aim to reach a point where we agree with the business role on how to act when the error budget is depleted, and when the error budget is depleted, someone on the server team takes action to refill the error budget.
💡
What about SLOs that don't consume any error budget at all? At WINTICKET, we identify cases where the error budget is fluctuating around 100% once every three or six months, and tighten the SLIs and SLOs to prevent the error budget from becoming excessive.
 
We use Wrike for task management at WINTICKET, so we also use it for SLO retrospectives.
We try to keep the number of tools to a minimum and aim to operate in a way that doesn't tire our team members out.
Wrike での SLO タスク管理
SLO Task Management in Wrike
The actual review process is as follows:
  1. Randomly select a facilitator
  1. Create a team for each category
  1. Teams view SLOs for their assigned categories in Grafana
  1. Wake up error budget exhaustion in Wrike
  1. Judgment on whether or not to respond
    1. If the problem is with the external API, no action is required.
  1. Check the status of SLOs already posted in Wrike (error budget exhausted)
    1. Comment and update the error budget column
 

The quality of service is actually getting worse day by day.


As mentioned earlier, the quality of service is getting worse day by day.
For example, latency increases with the number of DB records.
Or maybe you added a table as a new measure, but there was an index problem.
Even if there is no problem at first, the impact will become obvious as the number of records increases.
日々悪化していく例
It gets worse day by day

Why has SLI/SLO become so widespread?


Looking at the current state of WINTICKET adoption, it appears that people other than the members who promoted SLI/SLO have started to look for the causes of the deterioration using burn rate alerts and the SLO dashboard, and that the server team is independently implementing SLI/SLO.
In fact, I was the facilitator for the first review meeting, but now someone from the server team is taking over that role.
For now, we've come this far because we've been working together@taba2424It was also reassuring because the staff was so excellent.
In addition, the head of development at WINTICKET@akihisasenHowever, I think it was also a big factor that they had been interested in SLI/SLO from the beginning, knew its benefits, and created a structure that made it easy to move forward with it.
 

Conclusion: The future and ideals of SLI/SLO


Currently, SLI/SLO isA common language among the server teamHowever, it has not yet achieved its original goal of creating a "common language with business."
The next step for the SRE team is to incorporate SRE into business roles. To achieve this, we plan to take the following approaches:
  • Improved completeness
    • Adding a new SLO
    • Increase confidence in your SLOs (e.g., adjust your SLIs)
  • Involve someone from the business team in SLO review meetings with the server team
 
We have also created a business dashboard that shows the current SLOs so that business people can easily understand them.
We will utilize these to gradually increase penetration.
 

The need for business penetration and a culture where those who improve SLOs are praised


I believe that anyone who improves service quality should be praised and recognized not just by engineers, but by everyone involved with the product.
However, WINTICKET currently has no agreement with the business side.
First of all, we are currently working on improving the reliability and comprehensiveness of SLOs while operating and promoting SLOs across the entire server team.
 

Chat Embedded SRE? Enabling SRE?


I've been hearing the term "Enabling SRE" a lot recently, so I looked into it, but I've never seen it overseas.
X's community "SRE and observabilityWhen I asked a question, I received a lot of information.
There seems to be a description called "Enabling" in the book "Team Topology: Adaptive Organizational Design for Rapid Delivery of Valuable Software."
 
SRG is looking for people to work with us. If you are interested, please contact us here.