Impressions from performing a mob cost analysis and a review of cost reduction measures - How an SRE approaches cost management -
This is my first time contributing to this blog. My name is Nakajima.
Until recently, we worked together in the Service Reliability Group (SRG) of the Media Headquarters, and now I work mainly as an SRE in the Games and Entertainment Division (SGE).
If you like, please click "ElastiCache Verification: Updated Serverless Implementation Points with Valkey and Valkey 8.0Please also see ".
Today's post will be a relaxed look back at the Mob Cost Analysis meeting that was a hot topic this year, and will mainly focus on how to reduce SRE costs.
Fun Mob Cost AnalysisAs an SRE, I will analyze the reasons why something could not be done.Blocking factors in cost reductionSimple principles for SRE cost management"Even if we are told to cut costs, we do it all the time, so there is nothing we can cut right away."Conclusion
Fun Mob Cost Analysis
Mob Cost Analysis is a concept proposed by DELTA Co., Ltd. in a presentation at the AWS Cost Reduction Tenkaichi Budokai (I don't think I had heard of it before that).
In short, it is a meeting where everyone looks at the system's dashboard and discusses costs one way or another.
Cost-related work is often done individually when a problem becomes apparent, so I was very interested in the approach of making it a team-wide effort, and immediately tried putting it into practice in several projects.
As a result of trying it out, the cost reduction efforts also helped the participants to gain a deeper understanding of the system architecture, and I think it was a very meaningful initiative.Inevitably, costs are risingPart of the business back-endinfrastructureIn-depth discussions are held from each perspective, which is very satisfying for me as an engineer.
As an SRE, I will analyze the reasons why something could not be done.
Well, in a way, it's a lot of fun to find areas where costs have not been reduced through mob cost analysis meetings and then work to solve them.
I thought I'd take this opportunity to take a step further and explain in more detail what is actually happening to reduce costs.
- Before the mob cost analysis meeting, identify and list the tasks that have already been completed.
- Combine the tasks that have already been done and those that have not yet been done and classify them as follows:
- Categorize task types
- Architecture Changes
- Overprovisioning
- Removing abandoned resources
- others
- Analyze the blocks that prevented tasks from being completed
- I didn't notice
- Poor cost-effectiveness for labor hours/low task priority
- Adjustment is tedious
- etc…
We asked each department to compile this information and visualized how each task was being carried out.
I will not reveal the details as it contains information about multiple projects, but it looks something like this:
Figure 1) Sample analysis of work that has already been done

Figure 2) Sample analysis of tasks that were not done

This initiative is not being carried out by the entire group, such as SRG, and at this stage, it is only an analysis of several projects that I have personally been involved in.
What you can and can't do is mainly
- Clean up overprovisioning
- Cleaning up abandoned resources
It has become clear that this is the main
Blocking factors in cost reduction
The cost reduction blocking elements essentially indicate the reasons why they have not been able to get started immediately, but apart from the fact that they simply were not aware of it, I think there are various reasons that you can imagine depending on the project.
- Adjustment is tedious
- The adjustments alone are extremely difficult compared to the difficulty and importance of the work.
- This is evident in testing and internal environments.
- It was supposed to be easy to download, but as the number of people involved increased, it became difficult to work carelessly.
- The hurdles required are high
- Performance buffers, perfection of validation, etc. Most noticeable in production
- Low cost-effectiveness per labor hour
- The absolute amount of reduction is low
- The required man-hours are too large compared to the man-hours available at that time, etc.
- The task is given low priority
- Why that's not the focus of the project right now
Either way, it's easy to imagine that it would take a considerable amount of power for a player to break through these blocking elements on their own.
That's why, in other words, cost-cutting projects tend to get underway only when a top-down order to cut costs is given.
Simple principles for SRE cost management
So, based on the above analysis, how should you approach project cost management during peacetime?
It seems quite simple when you consider that over-provisioning and idle resources are the main culprits of wasted costs, and that even if you want to improve something, it takes a certain amount of effort to change something that has already started working.
- Don’t overprovision in the first place
The conclusion is that if a large portion of the improved work is due to over-provisioning, we should make a strong effort to prevent over-provisioning from occurring in the first place.
If no work occurs, there will be no adjustments, there will be no "I'll do it later" -> "I won't do it later (I can't do it)", and it won't be the case that the cost remains the same because the reduction effect is less than the labor hours.
Cost reduction is not something that can be done all at once; by putting in just a little extra effort during normal times to accumulate the right tasks, you can reduce the amount of cost reduction work that will arise later.
This is exactly like technical debt. In today's world, if you use the cloud carelessly, it's not a debt, but a direct cash outflow.
The analysis deals with over-provisioning, but I think it also includes avoiding configurations that require architectural changes by using specialized knowledge to design the system appropriately in the early stages.
In addition, the following elements will need to be supported:
- Regular review efforts
- Basics of measures against abandoned resources
- Dealing with changing assumptions
- Don't leave the specs you used as temporary
- Check the status after handing over the system
- Don't let the decision to hit with costs go unresolved forever.
- To prevent things from being overlooked or not being noticed, it is best to have multiple people check.
- Cultivating a culture of proactive cost optimization(Team + Project)
- Being able to act without being told to worry about costs
- Aim to move away from relying on experts (visualization, democratization)
- Spread the use of cost analysis tools (CostExplorer for AWS)
- A culture where project teams always check resource utilization
In implementing these regular reviews and fostering a culture of doing so, I believe that holding regular initiatives such as mob cost analysis meetings can be a simple measure that can have a significant effect.
"Even if we are told to cut costs, we do it all the time, so there is nothing we can cut right away."
As a result of carrying out the above analysis, I have come to keep in mind that even if I am asked to reduce costs in the projects I am in charge of, there is no need to cut back immediately, as I do it all the time.
There are many actions that SREs must take, from system design to SLI/SLO, but one goal that can be set is to have SREs continually operate a lean, cost-efficient system.
We would like to foster a culture where projects and teams can be evaluated in terms of cost management in peacetime, so that we can be proud of the contributions that SREs are making.
Conclusion
When it comes to reducing infrastructure costs, it often seems like the areas that have been wasteful are the ones that have seen cost cuts and results appear to be the ones that have been achieved, so it's not surprising that there are so many such announcements.
I would be happy to announce the results of our operation under the principle that "even if you say we need to cut costs, we don't need to cut anything right away because we do it all the time." It would also be interesting to have a presentation that summarizes such efforts.
What would be the results if you analyzed the cost reduction measures and their blocking elements in your organization? I think that would change the guidelines for how you can act efficiently. If you are interested in learning more, please contact Nakajima or any member of SRG. Let's exchange information.
Thank you for reading this far.