SRE reliability measurement
This article isCyberAgent Group SRE Advent Calendar 2024This is the article for the first day. Since it's the first day, I've written it in a relaxed style.
IntroductionLooking back on 2024SRE reliability measurementMeasuring CapabilitiesTrend analysis and improvementConclusion
Introduction
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
Looking back on 2024
This year, not only my main jobSRE NEXT 2024I think it was a year in which I was able to contribute to the SRE community by serving as a Co-Chair for the event. To be honest, it was incredibly busy, but in the end, we were able to hold an event with high participant satisfaction, with over 1,500 participants and a 96.1% satisfaction rate, so I'm glad I did it.
Archived video of SRE NEXT 2024All of the footage, including the on-site panel sessions, has been uploaded to YouTube, so we hope you will take a look.
In my main business, I commercialized a cross-functional SRE organization in 2020, and was able to expand SRE support beyond the media business to our SGE jurisdiction (game division) and external parties. I would like to share what I did to commercialize this, as well as the SRE lifecycle that I briefly touched on in the panel session at SRE NEXT, on another blog or in a presentation if I have the opportunity.
In this article, I would like to write a little about measuring SRE reliability.
SRE reliability measurement
When you have a cross-functional SRE organization, you provide SRE support to multiple products, but it is difficult in reality to embed improvements in all products due to organizational resource constraints. Therefore, by taking a bird's-eye view of the entire business division and turning it into data, we have developed an approach called SRE maturity and SRE reliability measurement to make it easier to determine business priorities, and are working to improve SRE.
Regarding SRE maturity,Presentation materials from the CyberAgent Developer ConferenceI hope you will take a look.
The SRE maturity and SRE reliability measurements have the following differences in their objectives, targets, and questions. The SRE reliability measurement aims to understand the system reliability and risks for each executive's jurisdiction, and is targeted at executives. Furthermore, we aim to implement it in all key products.

Measuring Capabilities
In SRE reliability measurement,
- Security
- Capacity Planning
- availability
- Operational Optimization
The test measures capabilities in four categories. The questions are focused on high-risk items, so although there are some exceptions depending on the conditions, they are all essential items, and we are promoting their introduction into our key products.
Please answer the following items (some excerpts) with a Yes/No, and when you answer, we also collect documents and other deliverables that serve as the basis for your evaluation.

Trend analysis and improvement
For governance reasons, we have prepared a measurement Google Sheet for each product, and we use Google Apps Script to aggregate data into the analysis Google Sheet,
and then use Looker Studio to analyze trends by business division and item.

It may be difficult to understand because much of the information cannot be made public, but we analyze the data collected in the following manner to help us make efficient improvements.

Conclusion
The use of SRE reliability measurement has led to the realization that even if the technology stack and system design are the same, differences in organizational culture and development structure mean that evaluations are not necessarily the same. Furthermore, the SRE reliability measurement has also resulted in the penetration of an SRE culture at the company level and can be used as a reference when making technology investment decisions.
Although I won't go into detail in this article, SRG is developing and deploying SRE practices that will improve the reliability and resilience of business divisions and products based on data obtained from SRE reliability measurements. We also plan to make our SRE practices publicly available in the future.
Once again, I hope that this Advent Calendar will help many people learn about our group's efforts. Personally, I'm looking forward to it.
SRG is looking for people to work with us.
If you're interested, please contact us here.