A story about how I reaffirmed the importance of SRE, especially in the context of the large-scale voting service "WINTICKET".
Hasegawa from the Service Reliability Group (SRG) of the Media Management Division (@rarirureluis)is.
#SRGThe Service Reliability Group primarily provides comprehensive support for the infrastructure surrounding our media services, focusing on improving existing services, launching new ones, and contributing to open-source software (OSS).
This article describes the benefits I gained from participating in WINTICKET as an Embedded SRE.
SRE x SalesImproving Customer Lifetime Value (LTV)Improving customer satisfaction through system reliabilityCorrelation between SRE and salesSRE benefits obtained through WINTICKETEase of business integrationIt is possible to visualize the potentially deteriorating service quality.WINTICKET Service Introductionself-introductionEmbedded SREthe purposeTo take on SRE duties in a different departmentSRE and MonitoringPersonal monitoring and alerting toolsEstablishing alertsStandardization of alert systemsWe held a Grafana study session.Toil reductionCost reductioncommunicationWINTICKET System ConfigurationMonitoring architectureBeforeAfterSLI/SLO implementationCUJIdentifying SLI/SLOAddition and implementation of SLOsSLI is a good quality product.Metrics used for SLIAvailability and Latency SLOWhich is better?Target WindowError budgetError Budget ManagementHow to calculate the error budgetError budget burn ratefast burn rateslow burn rateLatency is only measured for slowness; availability varies depending on the nature of the application.Burn rate alert for late nightIf you want a more precise burn rateWhen the error budget is depletedSLI/SLO adoptionContent to be visualizedImplementation of a study sessionWe review the SLOs (Service Level Objectives) as a team every other week.How to do reviews on WINTICKETIn reality, service quality is deteriorating day by day.Why have we been able to implement SLI/SLO to this extent?In conclusion: The future and ideals of SLI/SLOThe need for business integration and a culture where those who improve SLOs are celebrated.Casual chat about: Ultrasound SRE? Enabling SRE?
SRE x Sales
For those unfamiliar with SRE, let me first explain the benefits of doing SRE.
The implementation of SRE contributes directly and indirectly to increased sales.Specifically, it will have a positive impact on sales in the following ways:
Improving Customer Lifetime Value (LTV)
SRE enhances customer satisfaction by improving service stability. In particular, it can significantly improve LTV (Customer Lifetime Value) in subscription-based services like SaaS.
Improving customer satisfaction through system reliability
Stable system operationThis builds customer trust and leads to long-term sales growth. In particular, the following factors influence sales:
- Reduce service downtime
- Improved system response performance
- Reducing the error rate
- Improving profitability through cost reduction
SRE is not just a system operation method,A strategic approach that directly and indirectly contributes to increasing a company's sales.is.
If you happen to come across this article, please be supportive of any engineers who are considering implementing SRE.
Correlation between SRE and sales
Many people who implement SRE (Site Reliability Engineering) want to correlate its effects with quantitative metrics (sales). I thought so too, but it was quite difficult, and I gave up.
Specifically, this is because there is a correlation between "high sales = high server load = worsening SLO (Service Level Objective)".
There are research papers from overseas suggesting that SRE contributes to sales.
I think it would be quite difficult to perform the same analysis as in this paper.
So, what motivates me to do SRE?
"We can proactively address the visibly deteriorating service quality."is.
Seeing the service quality visibly deteriorate is somehow motivating.
SRE benefits obtained through WINTICKET
First, let me share what I've found beneficial about working as an SRE at WINTICKET at this stage.
Ease of business integration
If you want to immediately share whether an external failure will affect WINTICKET, SLI/SLO is very useful.

It is possible to visualize the potentially deteriorating service quality.
An alert for error budget burn rate was triggered, and upon reviewing the metrics, we found that service quality was gradually deteriorating and the error budget was being consumed rapidly.


Not just with WINTICKET, but being able to spot these kinds of potential service quality deteriorations is something you wouldn't notice without implementing SLI/SLO, and I think that's one of the benefits of doing SRE.
WINTICKET Service Introduction

WINTICKETWINTICKET was launched in 2019 as an internet betting service for publicly operated sports such as bicycle racing and auto racing. Key features of the service include the ability to bet while watching race videos, and its extensive database of original WINTICKET data, including AI predictions and EX data.
In addition, we offer features that link with ABEMA's Keirin and Auto Race channels. WINTICKET became the No. 1 Keirin betting service in approximately two years since its release and continues to grow.
self-introduction
The reason I'm introducing myself here is because I'm not affiliated with WINTICKET.
For this Advent Calendar, as an SRG, I'd like to introduce my work as an Embedded SRE at WINTICKET, a different team.
Embedded SRE
Site Reliability Engineers (SRE) are working to instill a culture and knowledge of Site Reliability Engineering (SRE) within development organizations, enabling developers themselves to practice SRE. (Enabling SRE)
I'm on the SRG team, so I'm an Embedded SRE.
There's also the term "Enabling SRE," which I think is pretty much synonymous.
(Incidentally, searching for "Enabling SRE" overseas didn't yield any results.)
the purpose
- Spreading SRE culture and knowledge within the product development team (culture building)
- Support developers in proactively implementing SRE practices (fostering a culture of SRE adoption).
- Improving service reliability (SRE)
The ultimate goal of Enabling SRE is to cultivate members within each product team who can autonomously practice SRE.
To take on SRE duties in a different department
When I joined as an SRE, instead of immediately talking about SRE, I did various things to gain the trust of the service team.
Furthermore, since the service provider may become exhausted before they can reap the benefits of SRE, the goal is to mitigate this to some extent by building up their own reliability beforehand.
When you join a new team, your tasks will include setting up monitoring and alerting systems, reducing toil, and improving communication.
SRE and Monitoring
The reason we start by setting up monitoring and alerting is related to SRE (Site Reliability Engineering).
Without a proper monitoring environment, it's impossible to investigate when the error budget is depleted.
That's why Monitoring forms the base in the pyramid diagrams we often see.
Personal monitoring and alerting tools
WINTICKET uses Google Managed Prometheus and Grafana, but personally, I found Datadog easier to use.
Before joining WINTICKET, I was using Datadog through a service called DotMoney, and its graphical UI and rich SLI definition methods (log-based, SLIs that treat latency and availability the same) are features that Cloud Monitoring lacks.
Although I haven't used them in actual production, SigNoz, which is open source, and Grafana Cloud, which is suitable for small-scale projects or as a SaaS solution, both seemed easy to use.
The article states that it's unusable from an SRE perspective, but recent updates have added support for Range Vector Selectors similar to those in PromQL.I think it could also be used for SRE purposes.I think so.
Establishing alerts
Initially, the alerting environment was in place, but there were several issues. The alerting tools were split between Cloud Monitoring and Self-hosted Alertmanager, resulting in the same alerts being defined in both, and the inclusion of unnecessary alerts. In this state, alert maintenance was insufficient, and the effectiveness of fault detection was reduced.
Therefore, we addressed the following two points.
- Unification of alert systems
- We unified our alerting tools to Grafana, which we were already using as our visualization tool.
- Review of alert definitions with the team
- List all alerts and discuss with team members the necessity and importance of each alert.
These efforts have resulted in a simpler and more effective alert structure, improving monitoring operations.
Standardization of alert systems
Since we're already using Grafana, we decided to unify our alert system to Grafana as well.
Cloud Monitoring's alerting system is less flexible than Grafana's.
There were nearly 400 alerts to migrate, so we selected and migrated them one by one.
By removing alert rules that can be covered by SLI/SLO, and reducing the number of alerts requiring on-call duty...
Grafana's label-based alert rules (Notification Policies) are intuitive, making it easy to implement them by simply labeling only the alerts you want to be on call.

In addition to the above, we were able to gain four other benefits.
- Since WINTICKET also uses AWS, monitoring with Grafana has become much easier.
- I no longer have to manually set up port forwarding for Alertmanager (GKE).
- Client team alerts can now also be managed with Grafana.
- The alert notification content has become more flexible.

We held a Grafana study session.
The purpose of this study group is to provide knowledge about Grafana within the team and to enable the team to easily add alerts from Terraform.
That being said, one of my ulterior motives is to increase my visibility within my team as an SRE.
Toil reduction
Reducing toil is a quick way to gain trust.
for example,
- Completely unmaintained IaC
- A time-consuming deployment flow
is.
Reducing toil allows you to understand the service's infrastructure configuration and deployment flow, so finding toil is a bonus, and even if you don't find any, you can still understand the configuration, so there's no harm in doing it.
Cost reduction
I think cost reduction is a quick way to gain trust.
The advantage of this over toil reduction is that it can be easier in some cases, and cost savings are also welcome news for business teams, making it more effective.
communication
Participating in the server team's daily evening meetings, regular gatherings, and even going out for drinks...and so on.
It may be small, but it adds up.
WINTICKET System Configuration
In summary, this is the system configuration diagram for WINTICKET.
(In reality, it's multi-region and quite large-scale.)

Monitoring architecture
We standardized our monitoring architecture to Grafana, eliminating Alertmanager and making it simpler.
Before

After

SLI/SLO implementation
When it comes time to actually implement SLI/SLO, we'll proceed with engineers from within the team who are interested in SLI/SLO.
While you can proceed alone, having someone who is familiar with the team's situation beside you can make things go more smoothly, so it's best to work with the team if possible.
Not only does it ensure a smooth process, but it also allows team members to accumulate knowledge about SLI/SLO, making it even more effective in the phase of disseminating it throughout the entire team.
CUJ
Since WINTICKET was already using SLI/SLO in its application, we simply reflected that on the server side as well.
For more information about CUJ, please see this article.
Identifying SLI/SLO
Once the CUJ is determined, we will identify the SLIs.
We will use a spreadsheet like the one below to summarize and organize the current latency and error rates.
In situations like this, having team members present makes things go much more smoothly.

Addition and implementation of SLOs
Once the identification process is complete, we will proceed with actually implementing the SLOs.
We will use Cloud Monitoring's SLO (Service Level Objective) feature, while managing the SLO dashboard with Grafana.
Let me introduce this first:The Cloud Monitoring dashboard can only display 100 SLOs per service.(In reality, you can register more than 100 items), so we are using Grafana to visualize the SLOs.
We also visualize server metrics using Grafana, so there's no need to switch between tools, which makes things much easier.
SLI is a good quality product.
When running a service, you can expect it to be accessed by crawlers, malicious users, and other third parties.
By rejecting such requests in advance,High-quality SLOIt can measure this.
At WINTICKET, we utilize Cloud Armor to implement various measures to prevent malicious requests from reaching our microservices.
Specifically, we use rate limiting and pre-configured Google Cloud Armor WAF rules to detect and promptly reject malicious or invalid requests. By blocking inappropriate requests, we maintain system health while further improving the reliability of SLI/SLO.
Metrics used for SLI
The metrics used as SLIs in WINTICKET are as follows:
- Prometheus metrics output by microservices
- GCP metrics collected by Cloud Monitoring
Availability and Latency SLO
WINTICKET's SLI/SLO uses latency and availability as separate SLOs.
While I personally believe this is the common practice, latency and availability can also be combined and measured as a single SLO.
In fact, DotMoney, which I participated in before WINTICKET as mentioned earlier, uses synthesized SLOs.
Cloud Monitoring's SLOs do not allow you to combine multiple metrics like Datadog does.
Which is better?
I think we should also take the characteristics of the service into consideration, but generally, it's easier to keep them separate.
When used together: Easy to manage
If they are separate: When investigating, it is easier because availability and latency are separate.
If they are separate, that alone increases the number of SLOs to manage.Particle sizeSince it's low, it makes reviewing SLOs (Service Level Objectives) quite easy.
Target Window
The Target Window is determined by the frequency of regular meetings and the development cycle.
While WINTICKET itself is deployed approximately once a week, the Target Window is set to 30 days, and we hold SLO review meetings for the entire team every two weeks.
Ideally, if service deployments occur every Wednesday, we could schedule regular SLO meetings every Thursday and set the Target Window to one week, allowing us to discuss changes in SLOs and error budgets due to feature releases.
However, frequent reviews are less likely to allow users to experience the effects of SLOs, so we intentionally use a longer review period to identify potential degradation and ensure that users feel that implementing SLOs was worthwhile.
And considering how things will behave when the error budget runs out, which will be explained in the next section, I think setting the Target Window to one week might be quite difficult.
Error budget
The error budget is derived from the SLO.Acceptable loss of reliabilityis.

Servers can temporarily violate SLOs due to high database load, etc.
It's about how much of the violated SLOs can be tolerated.
Error Budget Management
While it might be commendable that the error budget hasn't been consumed, this can be rephrased as:
"Are they deploying infrequently?" or "Are they not undertaking any technical challenges?"
The error budget is also the budget allocated for "technical challenges" against the Target SLO.
If you have excess error budget, tighten the SLO (Service Level Objective) to bring it down to exactly 0% relative to the Target Window.
How to calculate the error budget
The error budget is calculated from the SLO target value.
For example, if the SLO is 99.9%, the error budget will be 0.1%.
If the target window is 30 days (43,200 minutes), the error budget will correspond to 43.2 minutes of downtime.
Specific calculation example:If SLO is 99.9%
- SLO: 0.999
- Target Window: 30 days (43,200 minutes)
=(1−0.999)×43,200=0.001×43,200=43.2
Error budget burn rate
Burn rate is a term coined by Google that represents a unitless value indicating how quickly the error budget is consumed relative to the target length of an SLO (Service Level Objective). For example, if the target is 30 days, a burn rate of 1 means that, at a constant rate, the error budget will be completely consumed in exactly 30 days. A consumption rate of 2 means that, at a constant rate, the error budget will be depleted in 15 days, and a consumption rate of 3 means that it will be depleted in 10 days.
The Datadog documentation provides a clear explanation of burn rate.
WINTICKET sets up alerts for burn rate with 5-minute and 1-hour time windows for each SLO.
From now on, we will refer to these as the fast burn rate and the slow burn rate, respectively.
fast burn rate
The purpose of this is to detect if a bug occurs after release, and if the modified code has degraded the quality of the service.
Although it's not yet defined in the WINTICKET flow, we want to make it usable as a criterion for deciding to roll back after a canary release.
slow burn rate
This is an important warning signal indicating a chronic deterioration in the quality of service of the system, rather than a sudden failure like a fast burn rate.
While immediate action isn't necessary, this is an indicator that should be continuously monitored.
Latency is only measured for slowness; availability varies depending on the nature of the application.
For burn rate alerts related to availability SLOs, we have configured alerts for both fast and slow as mentioned above, but for latency, we have configured only slow alerts.


One reason for not applying fast burn rate to latency is that it can easily become alert noise.
If the endpoint being measured depends on an external API, the latency will be affected by the latency of that external API, so we do not apply the fast burn rate to the latency.
Even slow results can be obtained.
Conversely, in terms of availability, settings are configured to take advantage of the characteristics of both fast and slow configurations.
Burn rate alert for late night
Simple burn rate alerts, with their short time window, often occur during off-peak hours such as late at night when requests are low.
For example, if a system receives 10 requests per hour, one failed request represents a 10% error rate per hour. With a 99.9% SLO, this request would have a 1,000x burn rate and consume 13.9% of the 30-day error budget, triggering an alert immediately.

WINTICKET mitigates this somewhat by setting the burn rate threshold to 3 or 20.
Other approachessre.googleThe method described there involves artificially creating a normal request.
This method helps mitigate the impact of error requests, even during periods with fewer requests.
It seems a bit unrealistic, but I think it's interesting.
There are many other approaches introduced, so please take a look.
If you want a more precise burn rate
By combining multiple windows and multiple burn rates, false positives can be eliminated.
In this example, the alert fires when the burn rate reaches 14.4 (14.4: 2% of the error budget is consumed) at both the 5-minute and 1-hour intervals.
Alerting on SLOsQuote from
While this eliminates the benefit of immediacy, it allows you to create essential alerts that indicate a deterioration in service quality, excluding noisy issues that will quickly recover.
When the error budget is depleted
At DotMoney, where I previously participated as an Embedded SRE, I spoke with the business manager."Except for responding to production failures, making improvements to restore reliability, and releasing features involving external companies, we will prohibit feature releases once the error budget is exhausted."We were able to reach an agreement.
At WINTICKET, when the error budget runs out, we have developed a unique culture within the server team where we determine whether the cause is external, and if not, we create a task and assign members to restore the error budget.
SLI/SLO adoption
Now we've reached the point where we can actually code and visualize SLI/SLO.
We are creating a Grafana dashboard for each component.

Content to be visualized
What is the purpose of visualization?
The following items are necessary when reviewing SLOs:
- Error budget
- Current SLO

However, this alone is not enough.
Error budgets are depleted, and more detailed information is needed to delve deeper into the problem.
- Error Budget Time Series Graph
- SLI Time Series Graph
- Time Series latency graph (latency) for the relevant SLI
- Time Series response code (availability) for the relevant SLI

By deploying this within the same dashboard, you can enjoy the following benefits when reviewing:
- It is possible to determine when the condition worsened or improved.
- If it coincides with the release, then that's the reason.
- External service outage
- It is possible to determine if the condition is continuously worsening.
- If the SLO (Service Level Objective) has worsened, you can take measures to determine if it is showing signs of improvement.
- There are no other issues, and the configured SLI/SLO is too restrictive, so adjustments can be made.
Implementation of a study session
We will hold a study session for the entire team.
The goal is to help people understand SLI/SLO even a little, but it's impossible to understand it all in just one study session.
I didn't understand it either, so I shared things like, "This is what we're going to do!" and "These are the benefits!" and we did it in a kind of pep rally style.

We review the SLOs (Service Level Objectives) as a team every other week.
The purpose of reviewing the SLOs with the entire team is as follows:
- By working together as a team, we can cultivate a culture of SLI/SLO.
- (If service quality actually deteriorates) We want them to understand the benefits of SLI/SLO and maintain their motivation.
How to do reviews on WINTICKET
WINTICKET currently has 103 SLOs (Service Level Objectives) as of the time of writing this article.
Reviewing all of this would be quite exhausting.
Therefore, the review focuses on "only SLOs where the error budget has been depleted + what happened to the SLOs of the error budget that were depleted in the previous review?"
This allows for a simple, unburdening operation from the start.
Ultimately, the goal is to reach an agreement with the business roles regarding how to handle the depletion of the error budget, so that when the error budget is depleted, someone on the server team will replenish it.
What should be done about SLOs where the error budget is not being consumed at all?
At WINTICKET, we identify SLIs and SLOs that are hovering around 100% of their error budget every three or six months, and tighten them to prevent situations where the error budget is excessive.
Since WINTICKET uses Wrike for task management, we also use Wrike for reviewing our SLOs (Service Level Objectives).
We strive to minimize the number of tools we use and to operate in a way that doesn't overwhelm our team members.

The actual review process is as follows:
- A facilitator will be chosen randomly.
- Create a team for each category.
- The team checks the SLOs for their assigned category using Grafana.
- Wrike occurs when the error budget is depleted.
- Determination of whether or not to provide support
- In the case of an external API failure, no action is required.
- Check the status of the SLO (Error Budget Exhausted) that has already been created in Wrike.
- Leave a comment and update the error budget section.
In reality, service quality is deteriorating day by day.
As mentioned earlier, the quality of service is deteriorating day by day.
For example, an increasing number of database records can lead to increased latency.
Or perhaps a table was added as part of a new initiative, but there are issues with the index, etc.
Even if there are no problems at first, the impact will become blatant as the number of records increases.

Why have we been able to implement SLI/SLO to this extent?
Looking at the current state of WINTICKET adoption, it seems that members other than those who initially promoted SLI/SLO are now starting to investigate the causes of deterioration through burn rate alerts and the SLO dashboard, and the server team is taking the initiative to manage SLI/SLO on their own.

In fact, I was the facilitator for the first review meeting, but now someone from the server team is doing it.
The fact that we've come this far is thanks to everyone working together.@taba2424His exceptional abilities were also very reassuring.
In addition, the development manager of WINTICKET@akihisasenHowever, I think a major factor was that I was interested in SLI/SLO from the beginning, knew its benefits, and they created a structure that made it easy to move forward with it.
In conclusion: The future and ideals of SLI/SLO
Currently, SLI/SLOCommon language within the server teamIt remains at this stage, and has not yet achieved its original goal of "creating a common language for business."
The SRE team's next task is to integrate it into business roles. To achieve this, we plan to take the following approach:
- Improvement of comprehensiveness
- Addition of a new SLO
- Improve SLO reliability (e.g., adjust SLI).
- I'd like someone from the business team to participate in the server team's SLO review meeting.
We have also created a business-oriented dashboard that displays the current SLOs (Service Level Objectives) in a way that is easy for business people to understand.
We will utilize these methods to gradually increase their adoption.

The need for business integration and a culture where those who improve SLOs are celebrated.
I believe that those who improve service quality should be praised and appreciated not only by engineers, but by everyone involved with the product.
However, WINTICKET currently does not have an agreement with the business side.
First, we are working to implement and integrate SLOs throughout the entire server team, while also improving their reliability and comprehensiveness.
Casual chat about: Ultrasound SRE? Enabling SRE?
I've been hearing the term "Enabling SRE" a lot lately, and when I looked into it, I couldn't find it anywhere overseas.
The book "Team Topology: Adaptive Organizational Design for Rapidly Delivering Valuable Software" apparently contains a description of "Enabling."
SRG is looking for new team members.
If you are interested, please contact us here.





