I tried SRE for a service as an Embedded SRE
#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.
This article briefly introduces how SRG, a horizontal organization, creates and maintains an SRE culture in specific teams. I hope that this article will help you introduce SRE, as I will introduce the materials used to explain how to achieve this to the business side, and OSS to create the culture.
Goals of this article⚠ Before you deploy SREFind Toil etc.communicationEstablishing an incident response flow1. Discuss quality of service with the business sideMaterials used when explaining to the service provider2. Determine the Critical User Journey (CUJ)3. Determine SLI from CUJWhat metrics do the SLIs refer to?4. Writing SLO queries5. Target SLO/Window and Error BudgetAbout the Target Window6. When the error budget runs out7. Instilling SRE in the workplace, holding regular meetings, etc.Infuse SREs into the serviceTipsHow do I alert on SLOs?Error Budget Burn RateConclusion
Goals of this article
We will introduce the journey of SRG, a horizontal organization, as they joined the DotMoney service as Embedded SREs (Site Reliability Engineers) and built an SRE culture and improved service quality.
This article is just my personal method, and it may not apply to many situations. Therefore, I hope that you will find this method to be an acceptable option.
⚠ Before you deploy SRE
When I joined as an Embedded SRE, rather than just starting to talk about SRE, I did various things to gain the trust of the service department.
This was my first time as an SRE.And the service side may tire out before they can reap the benefits of SRE.The goal is to mitigate this somewhat by gaining your own credibility first.
The first thing I did was reduce toilets and communicate.
Find Toil etc.
We discovered Toil while investigating the infrastructure configuration of the service and the deployment flow.
for example,
- Completely unmaintained EC2 and IaC
- Excessive resource allocation leads to cost waste
- Long-running deployment flow
- There is no response flow when an incident occurs
etc.
Since the purpose of this article is SRE, I will not go into details, but by migrating from EC2 to Fargate, we were able to demonstrate clear numerical results on the service side in terms of EC2 and Ansible, cost waste due to excessive resource allocation, and improvements to the deployment flow.
communication
At DotMoney, we have daily server-side evening meetings and a general meeting once a week, so I attend those, actively review PR for IaC, and (obviously) respond to incidents.
Establishing an incident response flow
In order to ensure service quality, one of the responsibilities of SRE is to establish an incident response flow and shorten MTTR (Mean Time To Repair).
And that culture needs to continue.
The incident response flow documentation was old and the flow was not working properly, so we established it from scratch, including the business flow.
Here, we have been working on introducing on-call, establishing an incident response flow, and using Datadog Incident to measure MTTR and operate postmortems.
1. Discuss quality of service with the business side
This is impossible to achieve without cooperation from the business side to maintain service quality.
In particular, it would be difficult to take any action after the error budget has been depleted without the permission of the product manager.
Materials used when explaining to the service provider
These are the materials used to explain SRE at a general meeting that included engineers and business representatives on the service side.
I think it will be difficult to get everyone to understand everything here. I don't think many people can understand SRE in just a few minutes when they read the materials.
The goal of this meeting was to convey the concept of SRE (introducing SRE makes it possible to visualize service quality and provide information for business decisions).
2. Determine the Critical User Journey (CUJ)
After giving a general overview of SRE,The area with the greatest business impactFrom this, CUJ is determined.
Furthermore, instead of defining many CUJs right away, we decided on one CUJ first.
Since DotMoney is a points exchange site, the biggest business impact is
User visits the site → Opens product page list → Exchange is possible
We defined this series of steps as CUJ.
3. Determine SLI from CUJ
After determining the CUJ, we look for the SLI that serves as an indicator of the SLO.
This is also quite a difficult point.
What metrics do the SLIs refer to?
DotMoney does not implement Realtime User Monitoring (RUM), so it does not use metrics from the front end, but instead uses logs from the load balancer closest to the user.
On the other hand, if you use metrics from the front end, they will be affected by the user's network environment, and it will be a lot of work to filter out outliers, and RUM is always costly, so if you are introducing SRE for the first time, I think it would be easiest to use the metrics and logs of the load balancer closest to the user.
Ideally, if the correct data can be used on the front end, it will be more reliable than metrics from load balancers, etc. This is because it can track changes in service quality caused by front end implementations.
In an article published by Cindy Quach, a Google SRE, titled Learn how to set SLOs, she gives an example of using LB (Istio) without measuring at the front end.
And here's a Google Cloud Architecture Center article that goes into more detail about where to measure SLI from.
This article also states that client-side measurements involve many volatile factors and are therefore not suitable for response triggers.
Of course, it is possible to measure on the front end (client) if you put in the effort, but since this is my first time working as an SRE, I decided to measure on the load balancer closest to the user, which is the most convenient option.
4. Writing SLO queries
The CUJ this time (user visits the site → opens the product page list → can exchange) is all"A normal response can be returned within 3 seconds."It states that.
Why 3 seconds?
We investigated response times before deciding on the SLO, and found that 3 seconds seemed like a good value that could be achieved at the current time.
In another service of ours, we actually intentionally add delays to API responses and set a time based on our own experience before the user experience is impaired.
Let's take as an example the Datadog query currently used by DotMoney when a user visits the homepage.
DotMoney is frequently subject to DoS attacks, so we filter out DoS-related requests and other suspicious requests.
Suspicious user agents change from day to day, so even though there is no actual impact on the quality of service, the SLO value gets worse and worse. For this reason, it is necessary to periodically review the SLO.
@http.status_code:([200 TO 299] OR [300 TO 399])
5. Target SLO/Window and Error Budget
When we actually calculated the current SLO using the query in the previous section, we found that it was around 99.5%.
Therefore, I set all the Target SLOs to 99.5%.
Target SLO can be flexibly lowered or raised.
It is important to set an appropriate value at first and then review it regularly.
Note that in this case, not spending your error budget is not a good thing.

As you can see in the image above, the error budget for the "Replacement Complete" SLO monitor is completelyNot consumedYou will realize that.
At first glance, it may be praised that the error budget is not being consumed, but in other words, it can be seen as "are there fewer deployments?" or "are they not taking on technical challenges?"
The error budget is the budget allocated to the "technical challenge" against the Target SLO.
If you have an error budget remaining, try tightening your target so that it is exactly 0% of the target window. Or, try increasing the number of deployments and taking on technical challenges.
About the Target Window
The Target Window is determined by your regular frequency and development cycle.
The deployment cycle for DotMoney itself is approximately once a week, but since I am the only Embedded SRE working on DotMoney, I decided that 30 days would be a good Target Window given our resources.
For example, if your service is deployed every Wednesday, you can set SLO schedules on Thursdays with a one-week Target Window, allowing you to discuss changes to SLOs and error budgets due to feature releases.
Considering how things will behave when the error budget runs out, which I will explain in the next section, I think setting the Target Window to one week may be quite strict.
6. When the error budget runs out
With the consent of the business side, DotMoney will determine how to behave when the error budget is exhausted.
"Except for production outages, fixes to restore reliability, and releases involving external companies, we will prohibit feature releases when the error budget is exhausted."
We decided to do so.

DotMoney makes an exception because there are times when it is absolutely necessary to release information due to relationships with external companies.
Additionally, we also make exceptions when we prohibit the release of a feature if we have the resources to make modifications to restore reliability.
This will slow down feature releases, but it doesn't prevent them.
In fact, when the error budget ran out, we were close to releasing a new feature, but we were able to secure the resources of one DotMoney engineer to restore reliability, and two of us, including myself, worked on modifications to restore reliability while releasing the new feature in parallel, so I think the way we handled the time when the error budget ran out was quite good.
7. Instilling SRE in the workplace, holding regular meetings, etc.
SRE is a never ending culture.
Once SRE is introduced, it is necessary to hold regular reviews and to ensure that SRE is incorporated into the service.
We regularly check not only the error budget, but also review what we have done so far to see if our SLIs are correct, and whether our SLOs are too lenient or too strict.
The regular meeting I'm referring to is the one I belong to.SRGInstead of doing it by myself, Embedded SRE and DotMoney engineersIncluding the business sidepeopleWe will hold regular meetings.
The goal is to spread SRE by involving people at DotMoney.
Infuse SREs into the service
When we thought about how we could instill SRE throughout the entire service, we realized that it would be difficult not only for regular meetings, but also for those in charge of CS, the front office, and the business side.
So we created a tool that posts an SLO summary once a week to a random channel that everyone on the service side joins.
datadog-slo-insufflate
It is available as a container image, so it can be used easily.


Although the number of reactions is still small, more people are responding than at the beginning.
It's a tool that's better to have than not have.
Speaking of the outage, the SLO value was getting worse day by day.
Then one day, a large-scale outage occurred.
This was before we introduced the error budget burn rate, which I'll introduce in the Tips below, so I didn't notice in advance, but I learned that if the SLO value gets worse, it will eventually explode.
Tips
How do I alert on SLOs?
We have not set any alerts on the SLOs this time, and we have no plans to do so in the future.
This is because we have set an alert for the error budget burn rate, which we will explain later.
Error Budget Burn Rate
Burn rate is a term coined by Google and is a unitless value that indicates how quickly the error budget is consumed relative to the target length of the SLO. For example, if your goal is 30 days, a burn rate of 1 means that a constant rate of 1 would completely consume your error budget in exactly 30 days. A constant burn rate of 2 means that it would take 15 days, and a burn rate of 3 means that it would take 10 days to deplete your error budget.
Datadog's documentation on burn rate is very clear.
This error budget burn rate allows you to alert directly on your error budget, eliminating the need to alert on SLOs.
It also kills two birds with one stone by eliminating alerts for 5xx error rates, which tend to be noise alerts.
The alerts actually set up on DotMoney look like this.
burn_rate("").over("30d").long_window("1h").short_window("5m") > 14.4
message
Since it is not possible to set short_window and long_window simultaneously via the web, I set it via Terraform (API). It may be set by the time this article is published.
Why are these settings set at the same time?
Using only short_window will increase the frequency of alerts and make them more noisy, so by adding long_window as a condition, we can reduce the noise and make the data more credible.
Conclusion
This article is like an account of my experience as an SRE beginner and how I introduced SRE to my service.
Introducing SRE is a high hurdle, but I think it's important to just give it a try.
Of course, SRE is both a culture and an organization, so it's not a case of just introducing it and then it's over; the story is likely to continue.
I don't fully understand SRE myself, but I think the quickest way to gain knowledge about SRE is to introduce it without thinking too deeply about it at first and improve it day by day. (It's hard to see the future and there's a lot of mundane work...)
What I'm thinking about right nowVisualize the link between SRE and business impactI want to do that.
For example, I want to determine how to visualize the difference between sales revenue when the error budget runs out and goes negative, and sales revenue when the error budget is fully used (or has a surplus), and take the impact that SRE has on business from a vague world to a reality.
If anyone has already done this please let me know.
SRG is looking for people to work with us. If you are interested, please contact us here.