I tried SRE for a service as an Embedded SRE

Mr. Hasegawa (@rarirureluis)is.
#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.
This article briefly introduces how SRG, a horizontal organization, creates and maintains an SRE culture in specific teams. I hope that this article will help you introduce SRE, as I will introduce the materials used to explain how to achieve this to the business side, and OSS to create the culture.
 

Goals of this article


We will introduce the journey of SRG, a horizontal organization, as they joined the DotMoney service as Embedded SREs (Site Reliability Engineers) and built an SRE culture and improved service quality.
This article is just my personal method, and it may not apply to many situations. Therefore, I hope that you will find this method to be an acceptable option.
 

⚠ Before you deploy SRE


When I joined as an Embedded SRE, rather than just starting to talk about SRE, I did various things to gain the trust of the service department.
This was my first time as an SRE.And the service side may tire out before they can reap the benefits of SRE.The goal is to mitigate this somewhat by gaining your own credibility first.
The first thing I did was reduce toilets and communicate.

Find Toil etc.

We discovered Toil while investigating the infrastructure configuration of the service and the deployment flow.
for example,
  • Completely unmaintained EC2 and IaC
  • Excessive resource allocation leads to cost waste
  • Long-running deployment flow
  • There is no response flow when an incident occurs
etc.
Since the purpose of this article is SRE, I will not go into details, but by migrating from EC2 to Fargate, we were able to demonstrate clear numerical results on the service side in terms of EC2 and Ansible, cost waste due to excessive resource allocation, and improvements to the deployment flow.

communication

At DotMoney, we have daily server-side evening meetings and a general meeting once a week, so I attend those, actively review PR for IaC, and (obviously) respond to incidents.

Establishing an incident response flow

In order to ensure service quality, one of the responsibilities of SRE is to establish an incident response flow and shorten MTTR (Mean Time To Repair).
And that culture needs to continue.
The incident response flow documentation was old and the flow was not working properly, so we established it from scratch, including the business flow.
Here, we have been working on introducing on-call, establishing an incident response flow, and using Datadog Incident to measure MTTR and operate postmortems.
 

1. Discuss quality of service with the business side


This is impossible to achieve without cooperation from the business side to maintain service quality.
In particular, it would be difficult to take any action after the error budget has been depleted without the permission of the product manager.

Materials used when explaining to the service provider

 
These are the materials used to explain SRE at a general meeting that included engineers and business representatives on the service side.
I think it will be difficult to get everyone to understand everything here. I don't think many people can understand SRE in just a few minutes when they read the materials.
The goal of this meeting was to convey the concept of SRE (introducing SRE makes it possible to visualize service quality and provide information for business decisions).
 

2. Determine the Critical User Journey (CUJ)


After giving a general overview of SRE,The area with the greatest business impactFrom this, CUJ is determined.
Furthermore, instead of defining many CUJs right away, we decided on one CUJ first.
Since DotMoney is a points exchange site, the biggest business impact is
User visits the site → Opens product page list → Exchange is possible
We defined this series of steps as CUJ.
 

3. Determine SLI from CUJ


After determining the CUJ, we look for the SLI that serves as an indicator of the SLO.
This is also quite a difficult point.

What metrics do the SLIs refer to?

DotMoney does not implement Realtime User Monitoring (RUM), so it does not use metrics from the front end, but instead uses logs from the load balancer closest to the user.
On the other hand, if you use metrics from the front end, they will be affected by the user's network environment, and it will be a lot of work to filter out outliers, and RUM is always costly, so if you are introducing SRE for the first time, I think it would be easiest to use the metrics and logs of the load balancer closest to the user.
Ideally, if the correct data can be used on the front end, it will be more reliable than metrics from load balancers, etc. This is because it can track changes in service quality caused by front end implementations.
In an article published by Cindy Quach, a Google SRE, titled Learn how to set SLOs, she gives an example of using LB (Istio) without measuring at the front end.
 
And here's a Google Cloud Architecture Center article that goes into more detail about where to measure SLI from.
Implementing SLO | Cloud Architecture Center | Google Cloud
This document defines some useful service level objectives (SLOs) for different types of common service workloads. This document is the second of a two-part series. Part 1, Defining SLOs, introduces SLOs, explains how they are derived from service level indicators (SLIs), and what SLOs are appropriate. State of DevOps reports on features that are recognized to improve software delivery performance. These two documents discuss the following features: Regardless of the domain, many services share common features and can use common SLOs. Below, we discuss common SLOs by service type and provide details on the SLIs that apply to each SLO. A request-driven service receives a request from a client (another service or a user), performs some calculations, sends a network request to the backend, and then returns a response to the client. Request-driven services are often measured by availability and latency SLIs. Availability as an SLI Availability SLIs indicate whether a service is up and running. Availability SLIs are defined as follows: Percentage of valid requests that were successfully processed. First, you need to determine your definition of "valid." Basic definitions could be "non-zero length" or "compliant with client-server protocol," but it's up to the service owner to decide what "valid" is. A common way to evaluate validity is through HTTP (or RPC) response codes. For example, HTTP 500 errors are usually considered server errors that count towards your SLO, while 400 errors are client errors that don't count towards your SLO. Once you've decided what to measure, examine all response codes returned by your system and make sure your application uses them appropriately and consistently. If you use error codes for your SLOs, it's important to make sure they're an accurate indicator of the user experience with your service. For example, if a user tries to order an item that's out of stock, does your site abort and return an error message, or does it suggest a similar item? To be used in your SLOs, error codes need to be tied to user expectations. Another possibility is that developers mishandle errors. They may have programmed an error to be returned if a user requests an item that's temporarily out of stock. But in reality, the system is functioning correctly and no errors are occurring. The code should return as successful even if the user is unable to purchase the required item. Of course, the owner of this service needs to know that the product is out of stock. However, not being able to sell it does not mean that there is an error for the customer and should not be counted towards the SLO. However, if the service cannot determine if the item is in stock because it cannot connect to a database, that is an error and counts towards the error budget. Services can also be more complex. For example, they may be handling asynchronous requests or serving long-running processes to users. In these cases, you can expose availability in a different way. However, we still recommend expressing availability as a percentage of valid requests that are successful. Availability can be defined as the number of minutes that a customer's workload is performed as requested (this approach is sometimes called the "good time" method of measuring availability). Availability of a virtual machine can be measured as the percentage of minutes after the first request for the VM that you can access the VM with SSH. Latency as an SLI The SLI for latency (also known as speed) indicates whether the service is fast enough. The SLI for latency is defined similarly to availability: the percentage of valid requests that are processed faster than a threshold. To measure latency, calculate the time difference between starting and stopping a timer for each request type. What matters is whether the latency is perceived by the user. A common mistake is to measure latency too accurately, since in reality it takes longer for the update to
This article also states that client-side measurements involve many volatile factors and are therefore not suitable for response triggers.
Of course, it is possible to measure on the front end (client) if you put in the effort, but since this is my first time working as an SRE, I decided to measure on the load balancer closest to the user, which is the most convenient option.
 

4. Writing SLO queries


The CUJ this time (user visits the site → opens the product page list → can exchange) is all"A normal response can be returned within 3 seconds."It states that.
💡
Why 3 seconds? We investigated response times before deciding on the SLO, and found that 3 seconds seemed like a good value that could be achieved at the current time. In another service of ours, we actually intentionally add delays to API responses and set a time based on our own experience before the user experience is impaired.
 
Let's take as an example the Datadog query currently used by DotMoney when a user visits the homepage.
 
DotMoney is frequently subject to DoS attacks, so we filter out DoS-related requests and other suspicious requests.
Suspicious user agents change from day to day, so even though there is no actual impact on the quality of service, the SLO value gets worse and worse. For this reason, it is necessary to periodically review the SLO.
@http.status_code:([200 TO 299] OR [300 TO 399])
 

5. Target SLO/Window and Error Budget


When we actually calculated the current SLO using the query in the previous section, we found that it was around 99.5%.
Therefore, I set all the Target SLOs to 99.5%.
💡
Target SLO can be flexibly lowered or raised. It is important to set an appropriate value at first and then review it regularly.
 
Note that in this case, not spending your error budget is not a good thing.
ドットマネーで利用している SLO モニター
SLO monitor used by DotMoney
As you can see in the image above, the error budget for the "Replacement Complete" SLO monitor is completelyNot consumedYou will realize that.
At first glance, it may be praised that the error budget is not being consumed, but in other words, it can be seen as "are there fewer deployments?" or "are they not taking on technical challenges?"
The error budget is the budget allocated to the "technical challenge" against the Target SLO.
If you have an error budget remaining, try tightening your target so that it is exactly 0% of the target window. Or, try increasing the number of deployments and taking on technical challenges.

About the Target Window

The Target Window is determined by your regular frequency and development cycle.
The deployment cycle for DotMoney itself is approximately once a week, but since I am the only Embedded SRE working on DotMoney, I decided that 30 days would be a good Target Window given our resources.
For example, if your service is deployed every Wednesday, you can set SLO schedules on Thursdays with a one-week Target Window, allowing you to discuss changes to SLOs and error budgets due to feature releases.
Considering how things will behave when the error budget runs out, which I will explain in the next section, I think setting the Target Window to one week may be quite strict.
 

6. When the error budget runs out


With the consent of the business side, DotMoney will determine how to behave when the error budget is exhausted.
"Except for production outages, fixes to restore reliability, and releases involving external companies, we will prohibit feature releases when the error budget is exhausted."
We decided to do so.
DotMoney makes an exception because there are times when it is absolutely necessary to release information due to relationships with external companies.
Additionally, we also make exceptions when we prohibit the release of a feature if we have the resources to make modifications to restore reliability.
This will slow down feature releases, but it doesn't prevent them.
In fact, when the error budget ran out, we were close to releasing a new feature, but we were able to secure the resources of one DotMoney engineer to restore reliability, and two of us, including myself, worked on modifications to restore reliability while releasing the new feature in parallel, so I think the way we handled the time when the error budget ran out was quite good.
 

7. Instilling SRE in the workplace, holding regular meetings, etc.


SRE is a never ending culture.
Once SRE is introduced, it is necessary to hold regular reviews and to ensure that SRE is incorporated into the service.
We regularly check not only the error budget, but also review what we have done so far to see if our SLIs are correct, and whether our SLOs are too lenient or too strict.
The regular meeting I'm referring to is the one I belong to.SRGInstead of doing it by myself, Embedded SRE and DotMoney engineersIncluding the business sidepeopleWe will hold regular meetings.
The goal is to spread SRE by involving people at DotMoney.

Infuse SREs into the service


When we thought about how we could instill SRE throughout the entire service, we realized that it would be difficult not only for regular meetings, but also for those in charge of CS, the front office, and the business side.
So we created a tool that posts an SLO summary once a week to a random channel that everyone on the service side joins.
 
datadog-slo-insufflate
It is available as a container image, so it can be used easily.
Although the number of reactions is still small, more people are responding than at the beginning.
It's a tool that's better to have than not have.
💡
Speaking of the outage, the SLO value was getting worse day by day. Then one day, a large-scale outage occurred. This was before we introduced the error budget burn rate, which I'll introduce in the Tips below, so I didn't notice in advance, but I learned that if the SLO value gets worse, it will eventually explode.
 

Tips


How do I alert on SLOs?

We have not set any alerts on the SLOs this time, and we have no plans to do so in the future.
This is because we have set an alert for the error budget burn rate, which we will explain later.

Error Budget Burn Rate

Burn rate is a term coined by Google and is a unitless value that indicates how quickly the error budget is consumed relative to the target length of the SLO. For example, if your goal is 30 days, a burn rate of 1 means that a constant rate of 1 would completely consume your error budget in exactly 30 days. A constant burn rate of 2 means that it would take 15 days, and a burn rate of 3 means that it would take 10 days to deplete your error budget.
Datadog's documentation on burn rate is very clear.
 
This error budget burn rate allows you to alert directly on your error budget, eliminating the need to alert on SLOs.
It also kills two birds with one stone by eliminating alerts for 5xx error rates, which tend to be noise alerts.
The alerts actually set up on DotMoney look like this.

burn_rate("").over("30d").long_window("1h").short_window("5m") > 14.4
message

💡
Since it is not possible to set short_window and long_window simultaneously via the web, I set it via Terraform (API). It may be set by the time this article is published.
💡
Why are these settings set at the same time? Using only short_window will increase the frequency of alerts and make them more noisy, so by adding long_window as a condition, we can reduce the noise and make the data more credible.
 

Conclusion


This article is like an account of my experience as an SRE beginner and how I introduced SRE to my service.
Introducing SRE is a high hurdle, but I think it's important to just give it a try.
Of course, SRE is both a culture and an organization, so it's not a case of just introducing it and then it's over; the story is likely to continue.
I don't fully understand SRE myself, but I think the quickest way to gain knowledge about SRE is to introduce it without thinking too deeply about it at first and improve it day by day. (It's hard to see the future and there's a lot of mundane work...)
What I'm thinking about right nowVisualize the link between SRE and business impactI want to do that. For example, I want to determine how to visualize the difference between sales revenue when the error budget runs out and goes negative, and sales revenue when the error budget is fully used (or has a surplus), and take the impact that SRE has on business from a vague world to a reality.
If anyone has already done this please let me know.
SRG is looking for people to work with us. If you are interested, please contact us here.