A story about how I reaffirmed the importance of SRE, especially in the context of the large-scale voting service "WINTICKET".

Hasegawa from the Service Reliability Group (SRG) of the Media Management Division (@rarirureluis)is.
#SRGThe Service Reliability Group primarily provides comprehensive support for the infrastructure surrounding our media services, focusing on improving existing services, launching new ones, and contributing to open-source software (OSS).
This article describes the benefits I gained from participating in WINTICKET as an Embedded SRE.
This article is aboutCyberAgent Group SRE Advent Calendar 2024This is the article for day 6.
 
SRE x SalesImproving Customer Lifetime Value (LTV)Improving customer satisfaction through system reliabilityCorrelation between SRE and salesSRE benefits obtained through WINTICKETEase of business integrationIt is possible to visualize the potentially deteriorating service quality.WINTICKET Service Introductionself-introductionEmbedded SREthe purposeTo take on SRE duties in a different departmentSRE and MonitoringPersonal monitoring and alerting toolsEstablishing alertsStandardization of alert systemsWe held a Grafana study session.Toil reductionCost reductioncommunicationWINTICKET System ConfigurationMonitoring architectureBeforeAfterSLI/SLO implementationCUJIdentifying SLI/SLOAddition and implementation of SLOsSLI is a good quality product.Metrics used for SLIAvailability and Latency SLOWhich is better?Target WindowError budgetError Budget ManagementHow to calculate the error budgetError budget burn ratefast burn rateslow burn rateLatency is only measured for slowness; availability varies depending on the nature of the application.Burn rate alert for late nightIf you want a more precise burn rateWhen the error budget is depletedSLI/SLO adoptionContent to be visualizedImplementation of a study sessionWe review the SLOs (Service Level Objectives) as a team every other week.How to do reviews on WINTICKETIn reality, service quality is deteriorating day by day.Why have we been able to implement SLI/SLO to this extent?In conclusion: The future and ideals of SLI/SLOThe need for business integration and a culture where those who improve SLOs are celebrated.Casual chat about: Ultrasound SRE? Enabling SRE?

SRE x Sales


For those unfamiliar with SRE, let me first explain the benefits of doing SRE.
The implementation of SRE contributes directly and indirectly to increased sales.Specifically, it will have a positive impact on sales in the following ways:

Improving Customer Lifetime Value (LTV)


SRE enhances customer satisfaction by improving service stability. In particular, it can significantly improve LTV (Customer Lifetime Value) in subscription-based services like SaaS.

Improving customer satisfaction through system reliability


Stable system operationThis builds customer trust and leads to long-term sales growth. In particular, the following factors influence sales:
  • Reduce service downtime
  • Improved system response performance
  • Reducing the error rate
  • Improving profitability through cost reduction
SRE is not just a system operation method,A strategic approach that directly and indirectly contributes to increasing a company's sales.is.
If you happen to come across this article, please be supportive of any engineers who are considering implementing SRE.

Correlation between SRE and sales


Many people who implement SRE (Site Reliability Engineering) want to correlate its effects with quantitative metrics (sales). I thought so too, but it was quite difficult, and I gave up.
Specifically, this is because there is a correlation between "high sales = high server load = worsening SLO (Service Level Objective)".
There are research papers from overseas suggesting that SRE contributes to sales.
User-Engagement Score and SLIs/SLOs/SLAs Measurements Correlation of E-Business Projects Through Big Data Analysis
The Covid-19 crisis lockdown caused rapid transformation to remote working/learning modes and the need for e-commerce-, web-education-related projects development, and maintenance. However, an increase in internet traffic has a direct impact on infrastructure and software performance. We study the problem of accurate and quick web-project infrastructure issues/bottleneck/overload identification. The research aims to achieve and ensure the reliability and availability of a commerce/educational web project by providing system observability and Site Reliability Engineering (SRE) methods. In this research, we propose methods for technical condition assessment by applying the correlation of user-engagement score and Service Level Indicators (SLIs)/Service Level Objectives (SLOs)/Service Level Agreements (SLAs) measurements to identify user satisfaction types along with the infrastructure state. Our solution helps to improve content quality and, mainly, detect abnormal system behavior and poor infrastructure conditions. A straightforward interpretation of potential performance bottlenecks and vulnerabilities is achieved with the developed contingency table and correlation matrix for that purpose. We identify big data and system logs and metrics as the central sources that have performance issues during web-project usage. Throughout the analysis of an educational platform dataset, we found the main features of web-project content that have high user-engagement and provide value to services’ customers. According to our study, the usage and correlation of SLOs/SLAs with other critical metrics, such as user satisfaction or engagement improves early indication of potential system issues and avoids having users face them. These findings correspond to the concepts of SRE that focus on maintaining high service availability.
I think it would be quite difficult to perform the same analysis as in this paper.
So, what motivates me to do SRE?
"We can proactively address the visibly deteriorating service quality."is.
Seeing the service quality visibly deteriorate is somehow motivating.
 

SRE benefits obtained through WINTICKET


First, let me share what I've found beneficial about working as an SRE at WINTICKET at this stage.

Ease of business integration


If you want to immediately share whether an external failure will affect WINTICKET, SLI/SLO is very useful.
ビジネスからの影響度に、SLI/SLO を共有することで連携がしやすい図
This diagram shows how sharing SLIs/SLOs based on business impact facilitates collaboration.

It is possible to visualize the potentially deteriorating service quality.


An alert for error budget burn rate was triggered, and upon reviewing the metrics, we found that service quality was gradually deteriorating and the error budget was being consumed rapidly.
今回 SLO 導入を一緒に進めている WINTICKET 最強エンジニアの1人 @taba2424
One of WINTICKET's top engineers is working with us to implement SLOs this time.@taba2424
Not just with WINTICKET, but being able to spot these kinds of potential service quality deteriorations is something you wouldn't notice without implementing SLI/SLO, and I think that's one of the benefits of doing SRE.
 

WINTICKET Service Introduction


WINTICKETWINTICKET was launched in 2019 as an internet betting service for publicly operated sports such as bicycle racing and auto racing. Key features of the service include the ability to bet while watching race videos, and its extensive database of original WINTICKET data, including AI predictions and EX data.
In addition, we offer features that link with ABEMA's Keirin and Auto Race channels. WINTICKET became the No. 1 Keirin betting service in approximately two years since its release and continues to grow.
 

self-introduction


Media Division, Service Reliability Group (SRG)@rarirureluis is.
The reason I'm introducing myself here is because I'm not affiliated with WINTICKET.
For this Advent Calendar, as an SRG, I'd like to introduce my work as an Embedded SRE at WINTICKET, a different team.
 

Embedded SRE


Site Reliability Engineers (SRE) are working to instill a culture and knowledge of Site Reliability Engineering (SRE) within development organizations, enabling developers themselves to practice SRE. (Enabling SRE)
I'm on the SRG team, so I'm an Embedded SRE.
There's also the term "Enabling SRE," which I think is pretty much synonymous. (Incidentally, searching for "Enabling SRE" overseas didn't yield any results.)

the purpose


  • Spreading SRE culture and knowledge within the product development team (culture building)
  • Support developers in proactively implementing SRE practices (fostering a culture of SRE adoption).
  • Improving service reliability (SRE)
The ultimate goal of Enabling SRE is to cultivate members within each product team who can autonomously practice SRE.
 

To take on SRE duties in a different department


When I joined as an SRE, instead of immediately talking about SRE, I did various things to gain the trust of the service team.
Furthermore, since the service provider may become exhausted before they can reap the benefits of SRE, the goal is to mitigate this to some extent by building up their own reliability beforehand.
When you join a new team, your tasks will include setting up monitoring and alerting systems, reducing toil, and improving communication.

SRE and Monitoring


The reason we start by setting up monitoring and alerting is related to SRE (Site Reliability Engineering).
Without a proper monitoring environment, it's impossible to investigate when the error budget is depleted.
That's why Monitoring forms the base in the pyramid diagrams we often see.

Personal monitoring and alerting tools


WINTICKET uses Google Managed Prometheus and Grafana, but personally, I found Datadog easier to use.
Before joining WINTICKET, I was using Datadog through a service called DotMoney, and its graphical UI and rich SLI definition methods (log-based, SLIs that treat latency and availability the same) are features that Cloud Monitoring lacks.
 
Although I haven't used them in actual production, SigNoz, which is open source, and Grafana Cloud, which is suitable for small-scale projects or as a SaaS solution, both seemed easy to use.
💡
The article states that it's unusable from an SRE perspective, but recent updates have added support for Range Vector Selectors similar to those in PromQL.I think it could also be used for SRE purposes.I think so.
 

Establishing alerts


Initially, the alerting environment was in place, but there were several issues. The alerting tools were split between Cloud Monitoring and Self-hosted Alertmanager, resulting in the same alerts being defined in both, and the inclusion of unnecessary alerts. In this state, alert maintenance was insufficient, and the effectiveness of fault detection was reduced.
Therefore, we addressed the following two points.
  1. Unification of alert systems
    1. We unified our alerting tools to Grafana, which we were already using as our visualization tool.
  1. Review of alert definitions with the team
    1. List all alerts and discuss with team members the necessity and importance of each alert.
These efforts have resulted in a simpler and more effective alert structure, improving monitoring operations.

Standardization of alert systems


Since we're already using Grafana, we decided to unify our alert system to Grafana as well.
Cloud Monitoring's alerting system is less flexible than Grafana's.
There were nearly 400 alerts to migrate, so we selected and migrated them one by one.
By removing alert rules that can be covered by SLI/SLO, and reducing the number of alerts requiring on-call duty...
Grafana's label-based alert rules (Notification Policies) are intuitive, making it easy to implement them by simply labeling only the alerts you want to be on call.
In addition to the above, we were able to gain four other benefits.
  • Since WINTICKET also uses AWS, monitoring with Grafana has become much easier.
  • I no longer have to manually set up port forwarding for Alertmanager (GKE).
  • Client team alerts can now also be managed with Grafana.
  • The alert notification content has become more flexible.
    • Grafana Notification Template による情報の整理と出力
      Organizing and outputting information using Grafana Notification Templates

We held a Grafana study session.


The purpose of this study group is to provide knowledge about Grafana within the team and to enable the team to easily add alerts from Terraform.
That being said, one of my ulterior motives is to increase my visibility within my team as an SRE.
 

Toil reduction


Reducing toil is a quick way to gain trust.
for example,
  • Completely unmaintained IaC
  • A time-consuming deployment flow
is.
Reducing toil allows you to understand the service's infrastructure configuration and deployment flow, so finding toil is a bonus, and even if you don't find any, you can still understand the configuration, so there's no harm in doing it.
 

Cost reduction


I think cost reduction is a quick way to gain trust.
The advantage of this over toil reduction is that it can be easier in some cases, and cost savings are also welcome news for business teams, making it more effective.
 

communication


Participating in the server team's daily evening meetings, regular gatherings, and even going out for drinks...and so on.
It may be small, but it adds up.
 

WINTICKET System Configuration


In summary, this is the system configuration diagram for WINTICKET. (In reality, it's multi-region and quite large-scale.)
 

Monitoring architecture


We standardized our monitoring architecture to Grafana, eliminating Alertmanager and making it simpler.

Before

After

 

SLI/SLO implementation


When it comes time to actually implement SLI/SLO, we'll proceed with engineers from within the team who are interested in SLI/SLO.
While you can proceed alone, having someone who is familiar with the team's situation beside you can make things go more smoothly, so it's best to work with the team if possible.
Not only does it ensure a smooth process, but it also allows team members to accumulate knowledge about SLI/SLO, making it even more effective in the phase of disseminating it throughout the entire team.
This time@taba2424And that's how we proceeded.

CUJ


Since WINTICKET was already using SLI/SLO in its application, we simply reflected that on the server side as well.
For more information about CUJ, please see this article.

Identifying SLI/SLO


Once the CUJ is determined, we will identify the SLIs.
We will use a spreadsheet like the one below to summarize and organize the current latency and error rates.
In situations like this, having team members present makes things go much more smoothly.
WINTICKET 最強エンジニアの1人 @taba2424 作
One of WINTICKET's top engineers@taba2424 Made by

Addition and implementation of SLOs


Once the identification process is complete, we will proceed with actually implementing the SLOs.
We will use Cloud Monitoring's SLO (Service Level Objective) feature, while managing the SLO dashboard with Grafana.
Let me introduce this first:The Cloud Monitoring dashboard can only display 100 SLOs per service.(In reality, you can register more than 100 items), so we are using Grafana to visualize the SLOs.
We also visualize server metrics using Grafana, so there's no need to switch between tools, which makes things much easier.
 

SLI is a good quality product.


When running a service, you can expect it to be accessed by crawlers, malicious users, and other third parties.
By rejecting such requests in advance,High-quality SLOIt can measure this.
At WINTICKET, we utilize Cloud Armor to implement various measures to prevent malicious requests from reaching our microservices.
Specifically, we use rate limiting and pre-configured Google Cloud Armor WAF rules to detect and promptly reject malicious or invalid requests. By blocking inappropriate requests, we maintain system health while further improving the reliability of SLI/SLO.

Metrics used for SLI


The metrics used as SLIs in WINTICKET are as follows:
  • Prometheus metrics output by microservices
  • GCP metrics collected by Cloud Monitoring
 

Availability and Latency SLO


WINTICKET's SLI/SLO uses latency and availability as separate SLOs.
While I personally believe this is the common practice, latency and availability can also be combined and measured as a single SLO.
In fact, DotMoney, which I participated in before WINTICKET as mentioned earlier, uses synthesized SLOs.
Cloud Monitoring's SLOs do not allow you to combine multiple metrics like Datadog does.

Which is better?


I think we should also take the characteristics of the service into consideration, but generally, it's easier to keep them separate.
When used together: Easy to manage
If they are separate: When investigating, it is easier because availability and latency are separate.
If they are separate, that alone increases the number of SLOs to manage.Particle sizeSince it's low, it makes reviewing SLOs (Service Level Objectives) quite easy.
 

Target Window


The Target Window is determined by the frequency of regular meetings and the development cycle.
While WINTICKET itself is deployed approximately once a week, the Target Window is set to 30 days, and we hold SLO review meetings for the entire team every two weeks.
Ideally, if service deployments occur every Wednesday, we could schedule regular SLO meetings every Thursday and set the Target Window to one week, allowing us to discuss changes in SLOs and error budgets due to feature releases.
However, frequent reviews are less likely to allow users to experience the effects of SLOs, so we intentionally use a longer review period to identify potential degradation and ensure that users feel that implementing SLOs was worthwhile.
And considering how things will behave when the error budget runs out, which will be explained in the next section, I think setting the Target Window to one week might be quite difficult.
 

Error budget


The error budget is derived from the SLO.Acceptable loss of reliabilityis.
日々増減するエラーバジェットの図
A diagram showing the daily fluctuations in the error budget.
Servers can temporarily violate SLOs due to high database load, etc.
It's about how much of the violated SLOs can be tolerated.

Error Budget Management


While it might be commendable that the error budget hasn't been consumed, this can be rephrased as: "Are they deploying infrequently?" or "Are they not undertaking any technical challenges?"
The error budget is also the budget allocated for "technical challenges" against the Target SLO.
If you have excess error budget, tighten the SLO (Service Level Objective) to bring it down to exactly 0% relative to the Target Window.

How to calculate the error budget


The error budget is calculated from the SLO target value.
For example, if the SLO is 99.9%, the error budget will be 0.1%. If the target window is 30 days (43,200 minutes), the error budget will correspond to 43.2 minutes of downtime.
Specific calculation example:If SLO is 99.9%
  • SLO: 0.999
  • Target Window: 30 days (43,200 minutes)
=(1−0.999)×43,200=0.001×43,200=43.2
 

Error budget burn rate


Burn rate is a term coined by Google that represents a unitless value indicating how quickly the error budget is consumed relative to the target length of an SLO (Service Level Objective). For example, if the target is 30 days, a burn rate of 1 means that, at a constant rate, the error budget will be completely consumed in exactly 30 days. A consumption rate of 2 means that, at a constant rate, the error budget will be depleted in 15 days, and a consumption rate of 3 means that it will be depleted in 10 days.
The Datadog documentation provides a clear explanation of burn rate.
WINTICKET sets up alerts for burn rate with 5-minute and 1-hour time windows for each SLO.
From now on, we will refer to these as the fast burn rate and the slow burn rate, respectively.

fast burn rate


The purpose of this is to detect if a bug occurs after release, and if the modified code has degraded the quality of the service.
Although it's not yet defined in the WINTICKET flow, we want to make it usable as a criterion for deciding to roll back after a canary release.

slow burn rate


This is an important warning signal indicating a chronic deterioration in the quality of service of the system, rather than a sudden failure like a fast burn rate.
While immediate action isn't necessary, this is an indicator that should be continuously monitored.
 

Latency is only measured for slowness; availability varies depending on the nature of the application.


For burn rate alerts related to availability SLOs, we have configured alerts for both fast and slow as mentioned above, but for latency, we have configured only slow alerts.
可用性アラート
Availability Alert
レイテンシアラート
Latency alert
One reason for not applying fast burn rate to latency is that it can easily become alert noise.
If the endpoint being measured depends on an external API, the latency will be affected by the latency of that external API, so we do not apply the fast burn rate to the latency.
Even slow results can be obtained.
Conversely, in terms of availability, settings are configured to take advantage of the characteristics of both fast and slow configurations.

Burn rate alert for late night


Simple burn rate alerts, with their short time window, often occur during off-peak hours such as late at night when requests are low.
For example, if a system receives 10 requests per hour, one failed request represents a 10% error rate per hour. With a 99.9% SLO, this request would have a 1,000x burn rate and consume 13.9% of the 30-day error budget, triggering an alert immediately.
WINTICKET Appチーム anies1212 作
WINTICKET App Teamanies1212 Made by
WINTICKET mitigates this somewhat by setting the burn rate threshold to 3 or 20.
Other approachessre.googleThe method described there involves artificially creating a normal request.
This method helps mitigate the impact of error requests, even during periods with fewer requests.
It seems a bit unrealistic, but I think it's interesting.
There are many other approaches introduced, so please take a look.

If you want a more precise burn rate


By combining multiple windows and multiple burn rates, false positives can be eliminated.
In this example, the alert fires when the burn rate reaches 14.4 (14.4: 2% of the error budget is consumed) at both the 5-minute and 1-hour intervals.
While this eliminates the benefit of immediacy, it allows you to create essential alerts that indicate a deterioration in service quality, excluding noisy issues that will quickly recover.
 

When the error budget is depleted


At DotMoney, where I previously participated as an Embedded SRE, I spoke with the business manager."Except for responding to production failures, making improvements to restore reliability, and releasing features involving external companies, we will prohibit feature releases once the error budget is exhausted."We were able to reach an agreement.
At WINTICKET, when the error budget runs out, we have developed a unique culture within the server team where we determine whether the cause is external, and if not, we create a task and assign members to restore the error budget.
 

SLI/SLO adoption


Now we've reached the point where we can actually code and visualize SLI/SLO.
We are creating a Grafana dashboard for each component.
 

Content to be visualized


What is the purpose of visualization?
The following items are necessary when reviewing SLOs:
  • Error budget
  • Current SLO
各コンポーネントの SLO サマリ
SLO summary for each component
However, this alone is not enough.
Error budgets are depleted, and more detailed information is needed to delve deeper into the problem.
  • Error Budget Time Series Graph
  • SLI Time Series Graph
  • Time Series latency graph (latency) for the relevant SLI
  • Time Series response code (availability) for the relevant SLI
 
各コンポーネントの SLO の詳細
Details of SLOs for each component
By deploying this within the same dashboard, you can enjoy the following benefits when reviewing:
  • It is possible to determine when the condition worsened or improved.
    • If it coincides with the release, then that's the reason.
    • External service outage
  • It is possible to determine if the condition is continuously worsening.
  • If the SLO (Service Level Objective) has worsened, you can take measures to determine if it is showing signs of improvement.
  • There are no other issues, and the configured SLI/SLO is too restrictive, so adjustments can be made.

Implementation of a study session


We will hold a study session for the entire team.
The goal is to help people understand SLI/SLO even a little, but it's impossible to understand it all in just one study session.
I didn't understand it either, so I shared things like, "This is what we're going to do!" and "These are the benefits!" and we did it in a kind of pep rally style.
@taba2424 作
@taba2424 Made by

We review the SLOs (Service Level Objectives) as a team every other week.


The purpose of reviewing the SLOs with the entire team is as follows:
  • By working together as a team, we can cultivate a culture of SLI/SLO.
  • (If service quality actually deteriorates) We want them to understand the benefits of SLI/SLO and maintain their motivation.

How to do reviews on WINTICKET


WINTICKET currently has 103 SLOs (Service Level Objectives) as of the time of writing this article.
Reviewing all of this would be quite exhausting.
Therefore, the review focuses on "only SLOs where the error budget has been depleted + what happened to the SLOs of the error budget that were depleted in the previous review?"
This allows for a simple, unburdening operation from the start.
💡
Ultimately, the goal is to reach an agreement with the business roles regarding how to handle the depletion of the error budget, so that when the error budget is depleted, someone on the server team will replenish it.
💡
What should be done about SLOs where the error budget is not being consumed at all? At WINTICKET, we identify SLIs and SLOs that are hovering around 100% of their error budget every three or six months, and tighten them to prevent situations where the error budget is excessive.
 
Since WINTICKET uses Wrike for task management, we also use Wrike for reviewing our SLOs (Service Level Objectives).
We strive to minimize the number of tools we use and to operate in a way that doesn't overwhelm our team members.
Wrike での SLO タスク管理
SLO task management in Wrike
The actual review process is as follows:
  1. A facilitator will be chosen randomly.
  1. Create a team for each category.
  1. The team checks the SLOs for their assigned category using Grafana.
  1. Wrike occurs when the error budget is depleted.
  1. Determination of whether or not to provide support
    1. In the case of an external API failure, no action is required.
  1. Check the status of the SLO (Error Budget Exhausted) that has already been created in Wrike.
    1. Leave a comment and update the error budget section.
 

In reality, service quality is deteriorating day by day.


As mentioned earlier, the quality of service is deteriorating day by day.
For example, an increasing number of database records can lead to increased latency.
Or perhaps a table was added as part of a new initiative, but there are issues with the index, etc.
Even if there are no problems at first, the impact will become blatant as the number of records increases.
日々悪化していく例
Examples of things getting worse day by day

Why have we been able to implement SLI/SLO to this extent?


Looking at the current state of WINTICKET adoption, it seems that members other than those who initially promoted SLI/SLO are now starting to investigate the causes of deterioration through burn rate alerts and the SLO dashboard, and the server team is taking the initiative to manage SLI/SLO on their own.
In fact, I was the facilitator for the first review meeting, but now someone from the server team is doing it.
The fact that we've come this far is thanks to everyone working together.@taba2424His exceptional abilities were also very reassuring.
In addition, the development manager of WINTICKET@akihisasenHowever, I think a major factor was that I was interested in SLI/SLO from the beginning, knew its benefits, and they created a structure that made it easy to move forward with it.
 

In conclusion: The future and ideals of SLI/SLO


Currently, SLI/SLOCommon language within the server teamIt remains at this stage, and has not yet achieved its original goal of "creating a common language for business."
The SRE team's next task is to integrate it into business roles. To achieve this, we plan to take the following approach:
  • Improvement of comprehensiveness
    • Addition of a new SLO
    • Improve SLO reliability (e.g., adjust SLI).
  • I'd like someone from the business team to participate in the server team's SLO review meeting.
 
We have also created a business-oriented dashboard that displays the current SLOs (Service Level Objectives) in a way that is easy for business people to understand.
We will utilize these methods to gradually increase their adoption.
 

The need for business integration and a culture where those who improve SLOs are celebrated.


I believe that those who improve service quality should be praised and appreciated not only by engineers, but by everyone involved with the product.
However, WINTICKET currently does not have an agreement with the business side.
First, we are working to implement and integrate SLOs throughout the entire server team, while also improving their reliability and comprehensiveness.
 

Casual chat about: Ultrasound SRE? Enabling SRE?


I've been hearing the term "Enabling SRE" a lot lately, and when I looked into it, I couldn't find it anywhere overseas.
X communitySRE, observability, etc.When I asked a question there, I received a lot of information.
The book "Team Topology: Adaptive Organizational Design for Rapidly Delivering Valuable Software" apparently contains a description of "Enabling."
 
SRG is looking for new team members. If you are interested, please contact us here.