My impressions of conducting a mob cost analysis and a review of cost reduction measures: An SRE's approach to cost management.

This article is aboutCyberAgent Group SRE Advent Calendar 2024This is the article for day 4.
 
This is my first contribution to this blog. My name is Nakajima.
Until recently, I worked with you in the Service Reliability Group (SRG) of the Media Division, and currently I am mainly working as an SRE in the Game & Entertainment Division (SGE).
 
Today, I'd like to share a relaxed discussion mainly about the mob cost analysis meeting that was a hot topic this year, and how we're working to reduce SRE costs.
 

A fun mob cost analysis session


The concept of mob cost analysis was proposed by DELTA Co., Ltd. in a presentation at the AWS Cost Reduction World Tournament (I don't recall ever hearing about it before).
Simply put, it's a meeting where everyone looks at the system's dashboard and discusses costs from various angles.
Cost-related tasks are often handled individually as issues arise, so I was very interested in the approach of making it a team-wide effort, and I immediately tried it out in several projects.
The results of our experiment showed that it was a very meaningful initiative, as it led to a deeper understanding of the system architecture among the participating members through the cost reduction efforts.Inevitably, costs are increasing.Business backend related to a specific partinfrastructureThe discussions delved deeply into each perspective, which I, as an engineer, found very satisfying.
 

As an SRE, I'll try to analyze the reasons why things weren't going well.


Now, identifying areas where cost reductions haven't been achieved through mob cost analysis and working to resolve them is, in a way, a very enjoyable process.
Taking it a step further, I thought I'd take this opportunity to articulate a little more about what's actually happening in cost reduction efforts. Specifically,
  • Prior to the mob cost analysis meeting, we will identify and list the tasks that have already been completed.
  • Combine the tasks that have already been completed and those that have not yet been completed, and classify them as follows:
    • Classify the types of tasks
      • Architecture changes
      • Over-provisioning
      • Deleting neglected resources
      • others
    • Analyze the block elements where the task was not completed.
      • I didn't notice
      • Poor cost-effectiveness relative to man-hours / Low task priority
      • Adjusting is troublesome
      • etc…
I asked various departments to compile the information and visualized what tasks were being performed and how.
I'll omit the details because it involves information from multiple projects, but it's something like this:
 
Figure 1) Sample analysis of tasks that were already being done
 
Figure 2) Sample of analysis of tasks that were not completed.
 
This initiative is not being implemented across the entire group, including SRG, and at this stage, it is merely an analysis of several projects I have been personally involved in.
What I can do and what I can't do are mainly
  • Resolving overprovisioning
  • Organizing neglected resources
It has become clear that this is the main factor.
 

Blocking elements in cost reduction


The blockages related to cost reduction essentially indicate the reasons why immediate action hasn't been taken, but aside from simply not being aware of them, you can probably imagine that there are various reasons depending on the project.
 
  • Adjusting is troublesome
    • The meaning is that the adjustment process is excessively difficult compared to the difficulty and importance of the actual work.
    • This is particularly noticeable in testing environments and internal company environments.
    • It was supposed to be easy to download, but the number of people involved increased, making it impossible to work carelessly.
  • The bar is set high.
    • Performance buffers, perfection of testing, etc. These are particularly noticeable in production.
  • Low cost-effectiveness per man-hour
    • The absolute amount of the reduction is relatively low.
    • The required man-hours are too high compared to the man-hours available at that time, etc.
  • Tasks end up being given a low priority.
    • The reason is that that's not what we're prioritizing in the project right now.
 
In any case, it's easy to imagine that it would take considerable effort for the person in charge to overcome these blocking elements on their own.
Therefore, it could be said that cost reduction projects tend to get underway precisely when a top-down order to cut costs is issued.
 

Simple principles of cost management that SREs implement


Now, based on the above analysis, how should we approach cost management for projects during normal times?
If over-provisioning and neglected resources are the main culprits of wasted costs, and if changing something that's already running requires a certain amount of effort, then it seems quite simple, doesn't it?
 
  • Avoid over-provisioning altogether.
If a large part of the work that has been improved is over-provisioning, then the conclusion is that we should strongly focus on preventing over-provisioning from occurring in the first place.
If no work is required, there are no adjustments; the idea of ​​"I'll do it later" doesn't necessarily mean "I won't do it later (or can't do it)"; and the cost doesn't remain the same because the reduction in man-hours is less than the reduction in cost.
 
Cost reduction shouldn't be tackled all at once; rather, by investing a little extra effort in appropriate tasks from the start, you can prevent the need for cost reduction work from arising later.
This is exactly like technical debt. In modern times, carelessly using the cloud doesn't create a concept of technical debt; it directly results in cash outflow.
 
While the analysis indicates that overprovisioning is a problem, it also includes avoiding configurations that would necessitate architectural changes by performing appropriate system design with expert knowledge from the outset.
 
In addition, the following elements will need to be used to provide support.
  • Regular review efforts
    • Basic measures for dealing with neglected resources
    • Properly address changes in preconditions.
      • Don't leave the specs you used as they are.
      • Make sure to check the status of the system after it has been handed over.
      • Don't stick with the decision to rely on cost forever.
    • To prevent oversights and failures to notice things, it's best to have multiple people check the items.
  • Fostering a culture that is proactive in cost optimization(Team + Project)
    • The ability to act without being ordered to worry about costs.
    • Aiming to move away from relying solely on experts (transparency, democratization)
      • Expand your understanding of cost analysis tools (such as CostExplorer for AWS).
      • A culture where project teams constantly monitor resource utilization.
 
In implementing these periodic reviews and fostering a culture of regularity, I believe that regularly conducting initiatives like mob cost analysis meetings is a simple measure that can yield considerable results.
 

"Even though we're being asked to cut costs, there's nothing we can immediately cut because we're already doing it regularly."


As a result of the above analysis, I have come to keep in mind that "even if we are told to cut costs, there is nothing to cut immediately because we are already doing it regularly" in the projects I am in charge of.
The actions that SREs should undertake range from system design to SLI/SLO, but one of the goals that can be set is to continuously operate a lean and efficient system with no waste, thanks to the presence of SREs.
 
I want to foster a culture where project and team performance are evaluated based on cost management from the outset, so that we can proudly say that SREs are making a significant contribution.
 

In conclusion


When it comes to reducing infrastructure costs, it often appears that the areas that were previously wasting money have cut costs and achieved significant results, so it's understandable that there are many such announcements.
I'd be happy to share the results of our operation, which was based on the principle of "Even though we're told to cut costs, there's nothing we can immediately cut because we're already doing it regularly." A presentation summarizing such initiatives would also be interesting.
 
What results would you get if you analyzed cost reduction measures and their blocking factors within your organization? This could change the guidelines for how you can act more efficiently. If you're interested in learning more, feel free to discreetly reach out to Nakajima or any of the SRG members. Let's exchange information!
Thank you for reading this far.