A story about the importance of SRE in the large-scale voting service "WINTICKET"
#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.
In this article, I will introduce the benefits I gained from participating in WINTICKET as an Embedded SRE.
SRE x SalesIncrease Customer Lifetime Value (LTV)Improved customer satisfaction through system reliabilityCorrelation between SRE and salesSRE Benefits Obtained by WINTICKETEase of business collaborationPotential deterioration in service quality can be visualizedWINTICKET Service Introductionself-introductionEmbedded SREthe purposeTo do SRE in another departmentSRE and MonitoringPersonal monitoring and alerting toolsAlert maintenanceUnified alert systemGrafana study sessionToil reductionCost reductioncommunicationWINTICKET system configurationMonitoring architectureBeforeAfterImplementing SLI/SLOCUJIdentifying SLIs/SLOsAdding and Implementing SLOsSLI is good qualityMetrics used for SLIAvailability and Latency SLOsWhich is better?Target WindowError BudgetWorking with error budgetsHow to calculate the error budgetError Budget Burn Ratefast burn rateslow burn rateLatency is slow only, availability varies depending on the natureBurn rate alert at midnightIf you want a more accurate burn rateWhen the error budget is exhaustedThe spread of SLI/SLOWhat to visualizeConducting study sessionsReview SLOs with the whole team every two weeksHow to do reviews on WINTICKETThe quality of service is actually getting worse day by day.Why has SLI/SLO become so widespread?Conclusion: The future and ideals of SLI/SLOThe need for business penetration and a culture where those who improve SLOs are praisedChat Embedded SRE? Enabling SRE?
SRE x Sales
For those who are unfamiliar with SRE, I will first explain the benefits of doing SRE.
The introduction of SRE contributes directly and indirectly to increased salesSpecifically, it will have a positive impact on sales in the following ways:
Increase Customer Lifetime Value (LTV)
SREs promote customer satisfaction by improving the stability of services, which can significantly improve LTV (customer lifetime value), especially in subscription-based services such as SaaS.
Improved customer satisfaction through system reliability
Stable system operationThis will help you gain customer trust and increase your sales in the long term. In particular, the following factors will affect your sales:
- Reduce service downtime
- Improved system response performance
- Reduce error rates
- Improve profitability through cost reduction
SRE is not just a system operation methodology.A strategic approach that directly and indirectly contributes to increasing corporate salesis.
If you happen to come across this article in a business role, please be there for any engineers who are trying to introduce SRE.
Correlation between SRE and sales
I think many people want to correlate the effectiveness of SRE with quantitative figures (sales). I thought the same, but it was quite difficult and I gave up.
Specifically,
there is a correlation between "high sales = high server load = worsening SLO."
There are papers overseas that suggest that SRE contributes to sales.
I think it would be quite difficult to perform the same analysis as this paper.
So what motivates you to do SRE?
"We can deal with the visibly deteriorating quality of service in advance."is.
It's kind of motivating when you can see the service quality visibly deteriorating.
SRE Benefits Obtained by WINTICKET
First, I would like to talk about the benefits of being an SRE at WINTICKET at this stage.
Ease of business collaboration
SLI/SLO is very useful if you want to immediately share whether WINTICKET is affected when some external failure occurs.

Potential deterioration in service quality can be visualized
An alert was triggered for the error budget burn rate, and when we looked at the metrics, we saw that the service quality was gradually deteriorating and the error budget was being consumed rapidly.


This is not limited to WINTICKET; being able to see this kind of potential deterioration in service quality is something that would not be noticed without using SLI/SLO, and I think this is one of the benefits of being an SRE.
WINTICKET Service Introduction

WINTICKETwas launched in 2019 as an internet betting service for publicly-run Keirin and Auto Race events. The service's features include the ability to bet while watching race footage, and an extensive database of WINTICKET original data, including AI predictions and EX data.
We also provide functions linked to ABEMA's Keirin and Auto Race channels. WINTICKET became the No. 1 Keirin betting service about two years after its release, and is still growing today.
self-introduction
The reason I am introducing myself here is because I am not affiliated with WINTICKET.
In this advent calendar, as part of the SRG, I would like to introduce you to my work as an Embedded SRE for another team, WINTICKET.
Embedded SRE
This is an activity (Enabling SRE) in which Site Reliability Engineers instill the culture and knowledge of SRE (Site Reliability Engineering) within development organizations, enabling developers themselves to practice SRE practices.
Since I'm on the SRG team, I'm an Embedded SRE.
There is also the term "Enabling SRE," but I think it is roughly synonymous.
(By the way, when I searched for the term "Enabling SRE" overseas, I got no hits.)
the purpose
- Spreading SRE culture and knowledge to product development teams (culture building)
- Support developers to implement SRE practices voluntarily (culture building)
- Improving Service Reliability (SRE)
The ultimate goal of Enabling SRE is to develop members within each product team who can practice SRE autonomously.
To do SRE in another department
When I joined as an SRE, rather than just starting to talk about SRE, I did various things to gain the trust of the service team.
Additionally, there is a risk that the service side may become fatigued before it can enjoy the benefits of SRE, so the aim is to alleviate this somewhat by building up its own reliability first.
When I join a new team, the first thing I do is set up monitoring and alerts, reduce toilet seating, and communicate.
SRE and Monitoring
The reason for setting up monitoring and alerting first has to do with SRE.
When your error budget runs out, there is no way to investigate without a proper monitoring environment.
That's why Monitoring is the foundation of the pyramid diagram you often see.
Personal monitoring and alerting tools
At WINTICKET we use Google Managed Prometheus and Grafana, but I personally find Datadog easier to use.
Before joining WINTICKET, I used Datadog for a service called DotMoney. Its graphical UI and wide range of ways to define SLIs (log-based, SLIs that treat latency and availability the same) are features not available in Cloud Monitoring.
Although I haven't used it in actual operations, I got the impression that SigNoz, which is open source, and Grafana Cloud, which is small-scale or SaaS-friendly, are easy to use.
The article states that it cannot be used from an SRE perspective, but a recent update has added support for Range Vector Selectors like those found in PromQL.Can it be used for SRE purposes?I think so.
Alert maintenance
Initially, the alert environment was in place, but there were some issues. The alert tools were split between Cloud Monitoring and Self-hosted Alertmanager, and the same alerts were defined in both, or unnecessary alerts were included. In this state, alert maintenance was insufficient, and the effectiveness of fault detection was reduced.
So we worked on the following two points.
- Unified alert system
- The alert tool was unified with Grafana, which was already being used as a visualization tool.
- Review alert definitions with your team
- List all alerts and align the necessity and importance of the alerts with team members
These efforts have resulted in a simpler, more effective alert structure and improved monitoring operations.
Unified alert system
Since we are already using Grafana, we decided to unify our alert system with Grafana as well.
Cloud Monitoring's alerting system is less flexible than Grafana.
There were nearly 400 alerts to migrate, and we had to carefully select which ones to migrate.
Delete alert rules that can be supplemented with SLI/SLO, reduce the number of on-call alerts, etc.
Grafana's label-based alert rules (notification policies) are intuitive, and you can easily make this happen by labeling only the alerts you want to handle on-call.

In addition to the above, we gained four other benefits.
- WINTICKET also uses AWS, making it easy to monitor using Grafana.
- No need to set up Alertmanager (GKE) with Port Forwarding anymore
- Grafana can now be used to manage alerts for client teams.
- Added flexibility to alert notifications

Grafana study session
The purpose of this workshop is to provide the team with knowledge about Grafana and to enable them to easily add alerts from Terraform.
That said, one of the ulterior motives is to increase my profile as an SRE within my team.
Toil reduction
Reducing toilet is a quick way to gain credibility.
for example,
- Completely unmaintained IaC
- Long-running deployment flow
is.
Toil reduction allows you to understand the infrastructure configuration and deployment flow of a service, so if you find any toilets you're lucky, but even if you don't, you'll still be able to understand the configuration, so it's worth trying.
Cost reduction
I think reducing costs is also a quick way to gain trust.
The advantage over toilet reduction is that in some cases it can be easier and more effective because the cost savings are good news for business teams.
communication
Participate in the daily server team evening meetings, regular events, drinking parties, etc.
It's small, but it adds up.
WINTICKET system configuration
This is an overview of the WINTICKET system configuration.
(In reality, it is multi-regional and quite large-scale.)

Monitoring architecture
The monitoring architecture has been unified into Grafana, so Alertmanager has disappeared and it has become simpler.
Before

After

Implementing SLI/SLO
When it comes time to actually implement SLI/SLO, we will work with engineers on our team who are interested in SLI/SLO.
You can work alone, but having someone who knows the team's situation well next to you can make the process go more smoothly, so it's best to work with a team if possible.
Not only will it make things go more smoothly, but it will also help each team member gain knowledge about SLI/SLO, making it more effective when it comes to spreading the knowledge throughout the team.
CUJ
WINTICKET was already operating SLI/SLO on its app, so we reflected that on the server side as well.
If you would like to know more about CUJ, please read this article.
Identifying SLIs/SLOs
Once CUJ has been decided, we will begin to look into SLI.
We will summarize the current latency and error rates in a spreadsheet like the one below and organize them separately.
In situations like this, having someone on your team makes things go more smoothly.

Adding and Implementing SLOs
Once the identification is complete, we can actually implement the SLOs.
We use the SLO feature of Cloud Monitoring and use Grafana to run the SLO dashboard.
First, let me introduceCloud Monitoring dashboard can only display 100 SLOs per service(Actually, more than 100 can be registered.) We visualize the SLOs using Grafana.
Server metrics are also visualized using Grafana, which makes it easy to use without having to switch between tools.
SLI is good quality
When operating a service, crawlers and malicious users may access it.
By playing such requests in advance,High quality SLOcan be measured.
At WINTICKET, we use Cloud Armor to take various measures to prevent malicious requests from reaching our microservices.
Specifically, we use rate limiting and pre-configured Google Cloud Armor WAF rules to detect and quickly reject malicious requests. This helps us maintain the health of our systems by blocking inappropriate requests, while further improving the reliability of our SLIs/SLOs.
Metrics used for SLI
The metrics used as SLIs at WINTICKET are as follows:
- Prometheus metrics emitted by microservices
- GCP metrics collected by Cloud Monitoring
Availability and Latency SLOs
WINTICKET's SLI/SLO sets latency and availability as independent SLOs.
Personally, I believe this is common practice, but you can also combine latency and availability and measure it as a single SLO.
In fact, DotMoney, which I joined before WINTICKET mentioned above, uses synthetic SLOs.
Cloud Monitoring's SLOs do not allow you to combine multiple metrics like Datadog does.
Which is better?
I think you should also take into account the characteristics of the service, but generally it's easier to keep them separate.
Together: Easy to manage
Separate: Availability and latency are separate, making it easier to investigate.
If they are separate, the number of SLOs to be managed will increase.GranularityThis is low, so it's pretty easy when reviewing SLOs.
Target Window
The Target Window is determined by your regular frequency and development cycle.
The deployment cycle for WINTICKET itself is approximately once a week, but the Target Window is set to 30 days, and SLO review meetings are held with the entire team once every two weeks.
Ideally, if your service is deployed every Wednesday, you could set up a regular SLO schedule every Thursday with a one-week Target Window, allowing you to discuss changes to SLOs and error budgets due to feature releases.
However, frequent reviews are less likely to allow users to feel the effects of SLO, so we deliberately set the span to be longer in order to identify potential degradation and make users feel glad that they implemented SLO.
Also, considering how things will behave when the error budget runs out, which I will explain in the next section, I think setting the Target Window to one week may be quite strict.
Error Budget
Error budgets are derived from SLOsAcceptable loss of reliabilityis.

Servers may temporarily violate their SLO due to high database load, etc.
It's an idea of how much of a violated SLO you can tolerate.
Working with error budgets
You may be praised for not spending the error budget, but in other words, you can change your perspective to "are there fewer deployments?" or "are they not taking on technical challenges?"
The error budget is also the "technical challenge" budget allocated against the Target SLO.
If you have error budget remaining, tighten your SLO so that it is exactly 0% of the target window.
How to calculate the error budget
The error budget is calculated from the SLO targets.
For example, if your SLO is 99.9%, your error budget is 0.1%.
If your target window is 30 days (43,200 minutes), your error budget equates to 43.2 minutes of downtime.
Specific calculation example:If the SLO is 99.9%
- SLO: 0.999
- Target Window: 30 days (43,200 minutes)
=(1−0.999)×43,200=0.001×43,200=43.2
Error Budget Burn Rate
Burn rate is a term coined by Google and is a unitless value that indicates how quickly the error budget is consumed relative to the target length of the SLO. For example, if your goal is 30 days, a burn rate of 1 means that a constant rate of 1 would completely consume your error budget in exactly 30 days. A constant burn rate of 2 means that it would take 15 days, and a burn rate of 3 means that it would take 10 days to deplete your error budget.
Datadog's documentation on burn rate is very clear.
At WINTICKET, we set alerts on burn rate for one SLO with 5 minute and 1 hour time windows.
From now on, these will be referred to as fast burn rate and slow burn rate, respectively.
fast burn rate
The purpose is to detect if a bug occurs after a release and the quality of the service is degraded due to the changed code.
Although it is not yet defined in the flow within WINTICKET, we would like to make it possible to use it as a criterion for deciding whether to revert after a canary release.
slow burn rate
This is not a sudden failure like a fast burn rate, but an important warning sign that indicates a chronic deterioration in the system's service quality.
While no immediate action is required, it is an indicator that you should continue to be aware of.
Latency is slow only, availability varies depending on the nature
For the burn rate alerts on availability SLOs, we set the alerts to fast and slow as mentioned above, but for latency we only set it to slow.


One of the reasons we don't apply a fast burn rate to latency is the issue of noisy alerts.
If the endpoint being measured relies on an external API, we do not apply a fast burn rate to the latency because it will be affected by the latency of that external API.
Slow will give you good results.
On the other hand, availability is set up to take advantage of both fast and slow characteristics.
Burn rate alert at midnight
With a short time window, simple burn rate alerts often fire at times when there are fewer requests, such as late at night.
For example, if your system receives 10 requests per hour, one failed request would result in a 10% error rate per hour. With a 99.9% SLO, this request would result in 1,000x burn rate and consume 13.9% of your 30-day error budget, thus firing an alert immediately.

WINTICKET mitigates this somewhat by using burn rate thresholds of 3 or 20.
This is a way to reduce the impact of error requests, even during times of low request volume.
It seems a little unrealistic, but I think it's interesting.
There are many other approaches presented here.
If you want a more accurate burn rate
Combining multiple windows and multiple burn rates can help eliminate false positives.
In this example, the alert will fire when the burn rate reaches 14.4 (14.4: 2% error budget consumed) over both the 5 minute and 1 hour periods.
Alerting on SLOsQuote from
This eliminates the benefit of immediacy, but it allows you to configure essential alerts that indicate deterioration in service quality, eliminating noisy alerts that will quickly recover.
When the error budget is exhausted
Last time I participated in DotMoney as an Embedded SRE, I had a discussion with the business manager."Except for production outages, fixes to restore reliability, and releases involving external companies, we will prohibit feature releases when the error budget is exhausted."We were able to come to an agreement.
At WINTICKET, when the error budget runs out, we have formed a unique culture within the server team where we determine whether the cause is external, and if not, we cut tasks and assign members to restore the error budget.
The spread of SLI/SLO
Now we have reached the point where we can actually code and visualize the SLI/SLOs.
We create Grafana dashboards for each component.

What to visualize
What is the purpose of visualization?
The following items are required when reviewing SLOs:
- Error Budget
- Current SLO

But this is not enough.
The error budget is depleted, or more information is needed to dig deeper.
- Error budget time series graph
- SLI Time Series Graph
- Time Series Latency Graph for the SLI (Latency)
- Time Series response code for the applicable SLI (availability)

By deploying this in the same dashboard, you can enjoy the following benefits when reviewing:
- You can determine when your condition worsened or improved.
- If it overlaps with the release, that's the reason
- External service failure
- Can determine whether the condition is continuing to deteriorate
- If the SLO has deteriorated, you can take measures to determine whether it is improving.
- There are no other issues, and the SLI/SLO you set is strict, so you can make the decision to make adjustments.
Conducting study sessions
Conduct training sessions with the whole team.
The purpose is to help people understand SLI/SLO at least a little, but it is impossible to understand it in a one-time study session.
I didn't understand it either, so we shared things like, "This is what we're going to do!" and "These are the benefits!" and we had a sort of rallying call type of atmosphere.

Review SLOs with the whole team every two weeks
The purpose of reviewing SLOs with the entire team is to:
- Creating a culture of SLI/SLO through team-wide efforts
- (If service quality actually deteriorates) Help users realize the benefits of SLI/SLO and keep them motivated
How to do reviews on WINTICKET
At the time of writing this article, WINTICKET has 103 SLOs.
It would be exhausting to review all of this.
Therefore, the review looks at "only the SLOs where the error budget has been depleted + what happened to the SLOs where the error budget was depleted at the time of the previous review."
This allows you to start out with a simple operation that won't tire you out.
Ultimately, we aim to reach a point where we agree with the business role on how to act when the error budget is depleted, and when the error budget is depleted, someone on the server team takes action to refill the error budget.
What about SLOs that don't consume any error budget at all?
At WINTICKET, we identify cases where the error budget is fluctuating around 100% once every three or six months, and tighten the SLIs and SLOs to prevent the error budget from becoming excessive.
We use Wrike for task management at WINTICKET, so we also use it for SLO retrospectives.
We try to keep the number of tools to a minimum and aim to operate in a way that doesn't tire our team members out.

The actual review process is as follows:
- Randomly select a facilitator
- Create a team for each category
- Teams view SLOs for their assigned categories in Grafana
- Wake up error budget exhaustion in Wrike
- Judgment on whether or not to respond
- If the problem is with the external API, no action is required.
- Check the status of SLOs already posted in Wrike (error budget exhausted)
- Comment and update the error budget column
The quality of service is actually getting worse day by day.
As mentioned earlier, the quality of service is getting worse day by day.
For example, latency increases with the number of DB records.
Or maybe you added a table as a new measure, but there was an index problem.
Even if there is no problem at first, the impact will become obvious as the number of records increases.

Why has SLI/SLO become so widespread?
Looking at the current state of WINTICKET adoption, it appears that people other than the members who promoted SLI/SLO have started to look for the causes of the deterioration using burn rate alerts and the SLO dashboard, and that the server team is independently implementing SLI/SLO.

In fact, I was the facilitator for the first review meeting, but now someone from the server team is taking over that role.
For now, we've come this far because we've been working together@taba2424It was also reassuring because the staff was so excellent.
In addition, the head of development at WINTICKET@akihisasenHowever, I think it was also a big factor that they had been interested in SLI/SLO from the beginning, knew its benefits, and created a structure that made it easy to move forward with it.
Conclusion: The future and ideals of SLI/SLO
Currently, SLI/SLO isA common language among the server teamHowever, it has not yet achieved its original goal of creating a "common language with business."
The next step for the SRE team is to incorporate SRE into business roles. To achieve this, we plan to take the following approaches:
- Improved completeness
- Adding a new SLO
- Increase confidence in your SLOs (e.g., adjust your SLIs)
- Involve someone from the business team in SLO review meetings with the server team
We have also created a business dashboard that shows the current SLOs so that business people can easily understand them.
We will utilize these to gradually increase penetration.

The need for business penetration and a culture where those who improve SLOs are praised
I believe that anyone who improves service quality should be praised and recognized not just by engineers, but by everyone involved with the product.
However, WINTICKET currently has no agreement with the business side.
First of all, we are currently working on improving the reliability and comprehensiveness of SLOs while operating and promoting SLOs across the entire server team.
Chat Embedded SRE? Enabling SRE?
I've been hearing the term "Enabling SRE" a lot recently, so I looked into it, but I've never seen it overseas.
There seems to be a description called "Enabling" in the book "Team Topology: Adaptive Organizational Design for Rapid Delivery of Valuable Software."
SRG is looking for people to work with us. If you are interested, please contact us here.