About SRE Reliability Measurement
This article isCyberAgent Group SRE Advent Calendar 2024This is an article about the first day of the event. Since it's the first day, I'm writing it in a relaxed style.
IntroductionLooking back at 2024SRE Reliability MeasurementMeasuring CapabilitiesTrend Analysis and ImprovementConclusion
Introduction
#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.
Looking back at 2024
This year, not only my main jobSRE NEXT 2024I think it was a year in which I was able to contribute to the SRE community by serving as a co-chair for the event. To be honest, it was an incredibly busy year, but in the end, we had over 1,500 participants and a 96.1% satisfaction rate for the event, so I'm glad I did it.Archived video from SRE NEXT 2024All of the footage, including the on-site panel sessions, will be uploaded to YouTube, so we hope you will take a look.
In my main business, I commercialized a cross-sectional SRE organization in 2020, and was able to expand SRE support beyond the media business to our SGE jurisdiction (game division) and outside the company. I would like to share what I did to commercialize it and the SRE life cycle that I briefly touched on in the panel session at SRE NEXT in another blog post or on stage at another opportunity.
In this article, I would like to write a little about measuring SRE reliability.
SRE Reliability Measurement
When you have a cross-sectional SRE organization, you provide SRE support to multiple products, but it is realistically difficult to embed improvements in all products due to organizational resources. Therefore, by taking a bird's-eye view of the entire business division and digitizing it, we have developed an approach called SRE maturity and SRE reliability measurement to make it easier to determine business priorities, and are working to improve SRE.
Regarding SRE maturity,Presentation materials from the CyberAgent Developer ConferenceI would like you to take a look.
The SRE maturity and SRE reliability measurements have different objectives, targets, and questions, as shown below. The SRE reliability measurement is aimed at understanding the system reliability and risk for each executive's jurisdiction, and is targeted at executives. In addition, we basically aim to introduce it to all focus products.

Measuring Capabilities
In SRE reliability measurement,
- Safety features
- Capacity Planning
- availability
- Operational Optimization
The test measures capabilities in four categories. The questions are focused on high-risk items, so although there are some exceptions depending on the conditions, they are all essential items, and we are promoting their introduction into our focus products.
Participants will answer the following items (some excerpts) with a Yes/No, and when answering, we will also collect documents and other deliverables as evidence for the evaluation.

Trend analysis and improvement
From a governance perspective, we have prepared a measurement Google Sheet for each product, so we use Google Apps Script to aggregate data into the analysis Google Sheet, and then use Looker Studio to analyze trends by business division and item.

It may be difficult to understand because much of the information cannot be made public, but we analyze the data we collect in the manner described below and use it to make efficient improvements.

Conclusion
What we gained from using the SRE reliability measurement was the realization that even if the technology stack and system design are the same, the evaluation is not necessarily the same due to differences in organizational culture and development structure. In addition, the SRE reliability measurement was able to achieve results such as penetration of SRE culture at the company level and being able to be used as a reference when making technology investment decisions.
Although I won't go into it in this article, SRG is developing and deploying SRE practices that will improve the reliability and resilience of business divisions and products based on data obtained from SRE reliability measurements. We plan to make our SRE practices publicly available in the future.
Once again, I hope that this Advent Calendar will help many people learn about our group's efforts. Personally, I am looking forward to it.
SRG is looking for people to work with us. If you are interested, please contact us here.