Impressions from Mob Cost Analysis and a Review of Cost Reduction Measures - How SREs Approach Cost Management -
This is my first time contributing to this blog. My name is Nakajima.
Until recently, we worked together in the Service Reliability Group (SRG) of the Media Headquarters, and now I work mainly as an SRE in the Games and Entertainment Division (SGE).
If you'd like, please click "ElastiCache Review: Updated Serverless Implementation Points and Valkey 8.0Please also see "
Today, I would like you to read this as a relaxed talk looking back at the Mob Cost Analysis Meeting that was a hot topic this year, and mainly about how to approach reducing SRE costs.
Fun Mob Cost AnalysisAs an SRE, I will analyze the reasons why something wasn't done.Blocking factors in cost reductionSimple principles for cost control for SRE"Even if we are told to cut costs, we do it all the time, so there is nothing we can cut right away."Conclusion
Fun Mob Cost Analysis
Mob Cost Analysis is a concept proposed by DELTA Co., Ltd. in a presentation at the AWS Cost Reduction Tenkaichi Budokai (I don't think I'd heard of it before then).
In short, it is a meeting where everyone looks at the system's dashboard and discusses costs.
Cost-related work is often done individually when something becomes apparent, so I was very interested in the approach of making it a team-wide effort, and I immediately tried it out on several projects.
As a result of trying it out, the cost reduction efforts also led to a deeper understanding of the system architecture of the participating members, and I thought it was a very meaningful initiative.Inevitably, costs are highBusiness back-end related to the partinfrastructureIn-depth discussions are held from each perspective, which is very satisfying for me as an engineer.
As an SRE, I will analyze the reasons why something wasn't done.
Well, in a way, it's a very enjoyable task to find areas where costs have not been reduced through mob cost analysis meetings and then work to solve them.
I thought I would take this opportunity to take a step further and explain in more detail what cost reductions actually involve.
- Before the mob cost analysis meeting, identify and list the tasks that have already been completed.
- Combine the tasks that have already been done and the tasks that have not yet been done and classify them as follows:
- Categorize task types
- Architecture changes
- Overprovisioning
- Deleting abandoned resources
- others
- Analyze the blocks that didn't complete the task
- I didn't notice
- Poor cost-effectiveness for man-hours/low task priority
- Adjustment is tedious
- etc…
We asked each department to compile this information and visualized how each task was being carried out.
I will not reveal the details as it contains information from multiple projects, but it looks something like this:
Figure 1) Sample analysis of work that has already been done

Figure 2) Sample analysis of tasks that were not being done

This initiative is not being carried out by the entire group, such as SRG, and at this stage it is only an analysis of multiple projects that I have personally been involved in.
What has been done and what has not been done is mainly
- Cleaning up overprovisioning
- Cleaning up abandoned resources
It has been found that this is the main
Blocking factors in cost reduction
The cost reduction blocking factors essentially indicate the reasons why work has not been started immediately, but apart from the fact that they were simply not noticed, I think there are various reasons that you can imagine depending on the project.
- Adjustment is tedious
- The adjustment alone is extremely difficult compared to the difficulty and importance of the work.
- This is especially noticeable in testing environments and internal environments.
- It was supposed to be easy to download, but as the number of people involved increased, it became difficult to work carelessly.
- High hurdles required
- Performance buffers, perfect validation, etc. Mainly noticeable in production
- Low cost-effectiveness per man-hour
- The absolute amount of reduction is low
- The required man-hours are too large compared to the man-hours available at that time, etc.
- Tasks are given low priority
- The reason why the focus of the project is not on that right now
In any case, it's easy to imagine that it would take a considerable amount of power for a player to break through these blocking elements on their own.
That's why we could also say that cost-cutting projects tend to get moving when a top-down call for cost reduction is made.
Simple principles for cost control for SRE
So, based on the above analysis, how should you approach project cost management during normal times?
It seems quite simple when you consider that overprovisioning and idle resources are the main culprits of wasted costs, and that even if you want to improve something, it takes a certain amount of effort to change something that has already started working.
- Don't overprovision in the first place
If a large portion of the improved work is due to overprovisioning, then we can conclude that we should make a strong effort to prevent overprovisioning from occurring in the first place.
If no work occurs, there will be no adjustments, and it won't be the case that "I'll do it later" turns out to be "I won't (can't) do it later," and it won't be the case that costs remain the same because the reduction effect is less than the man-hours...
Cost reduction is not something that can be done all at once; by putting in a little extra effort during normal times and doing the right things over and over again, you can reduce the amount of cost reduction work that will have to be done later.
This is exactly like technical debt. In today's world, if you use the cloud carelessly, it's not a debt, but a direct cash outflow.
The analysis deals with overprovisioning, but I think it also includes avoiding configurations that require architectural changes by using specialized knowledge and designing the system appropriately in the early stages.
Additionally, the following elements will need to be supported:
- Regular review efforts
- Basics of measures against abandoned resources
- Dealing with changing assumptions
- Don't leave the specs you used as temporary
- Check the status of the system after it has been handed over
- Don't let the decision to hit with costs go unresolved forever.
- To prevent oversights and things going unnoticed, it is best to have multiple people check.
- Fostering a culture of proactive cost optimization(Team + Project)
- Being able to act without being told to worry about costs
- Aim to move away from leaving it up to experts (visualization, democratization)
- Expand your view of cost analysis tools (CostExplorer for AWS)
- A culture where project teams constantly check resource utilization
In implementing these regular reviews and fostering a culture of doing so, I believe that holding regular initiatives such as mob cost analysis meetings can be a simple measure that can have a significant effect.
"Even if we are told to cut costs, we do it all the time, so there is nothing we can cut right away."
As a result of the above analysis, I have started to keep in mind that even if I am told to cut costs, there is nothing I can immediately cut as I am already doing it on a regular basis when it comes to the projects I am in charge of.
There are various actions that SREs must take, from system design to SLI/SLO, but it is also a good goal to have SREs to continuously operate a lean, cost-efficient system.
We would like to foster a culture where projects and teams can be evaluated in terms of cost management even in peacetime, so that SREs can be proud to say that they are making a contribution.
Conclusion
When it comes to reducing infrastructure costs, it often seems like the areas that have been wasteful have seen the savings and results achieved, so it's no wonder that there are so many announcements like this.
I would be happy to announce the results of our operation, which is based on the principle that "even if we are told to cut costs, there is nothing we can cut right away because we are already doing it on a regular basis." It would also be interesting to hold a presentation summarizing such efforts.
What would be the results of analyzing cost reduction measures and their blocking factors in your organization? I think that will change the guidelines for how you can act efficiently. If you're interested in learning more, please feel free to contact Nakajima or any member of SRG. We'd love to exchange information.
Thank you for reading this far.