Incident Management from Scratch - Introduction

This article isCyberAgent Group SRE Advent Calendar 2024This is the 12th article.
 
Tanaka (@) of the Service Reliability Group (SRG) of the Media Headquarterstako_sonomono)is.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
 
This article will explain the purpose of introducing incident management and the three elements that should be established first.
We will provide separate information on tools for smooth incident management, specific workflows, and problem-solving methods.
 
 

Introduction


In the course of operating our systems on a daily basis, we encounter many alerts.
Problems that could be addressed when the system or organization was small will arise as the system and organization scales, with many challenges such as the following:
  1. The system is constantly issuing alerts without a person in charge
  1. Alerts are handled by specific developers, and the process is becoming increasingly individualized.
  1. High recurrence rate of past incidents
  1. There is no established response flow in the event of an incident that has a major impact on users, and recovery takes a long time.
  1. There are no set criteria for determining whether or not a user announcement is necessary, and a supervisor must make a decision each time.
  1. There was a monitoring setting error, and the problem was discovered through a user inquiry.
These issues ultimately delay recovery from failures, which in turn undermines user confidence in the service.
 
Incident management is used to resolve these issues.
In what follows, I will explain what needs to be done to solve these problems, including examples from Ameba.
 

The purpose of incident management


SRG provides incident management for each of our media service businesses, and is also promoting activities to resolve the above issues at Ameba. Here, the main objectives of incident management are the following three points:
 
  1. Reduced MTTR (Mean Time to Repair)
  1. Reduced number of incidents
  1. Preventing declines in user satisfaction

1. Reduced MTTR (Mean Time to Repair)

The time it takes for a system to recover from a failure or problemaveragerefers to.
Incident response operation cycleContinuousThe aim is to shorten the time from when a failure occurs to when it is restored by improving the system.
 

2. Reducing the number of incidents

The purpose is to reduce the occurrence of incidents by conducting postmortem and eliminating the seeds that lead to incidents in advance.
 

3. Preventing a decline in user satisfaction

If a service is unavailable for several hours without any announcement from the service side, users will lose trust in that service. The purpose of this project is to prevent a decline in trust in the service by optimizing communication with users when an incident occurs.
 
Below we explain what you need to do specifically to achieve these goals.

Introduction


  1. Establishing Incident Ownership
  1. Establishing an Incident Commander
  1. Triage decisions
 

1. Establish an Incident Owner

The incident owner is responsible for "introducing an incident management culture into the business andContinuousThis role is "responsible for making improvements."
Since it is extremely difficult to quantitatively measure the effectiveness of incident management after its implementation, it is not easy to implement in all organizations. In particular, for small organizations where the issues are not expanding, it is difficult to invest resources in it, so incident owners must consider the state of their business before implementing it.
 
Even after the decision to implement the system is made, continuous improvement is important in incident management, as evidenced by the term MTTR (Mean Time to Repair).If necessary, someone is needed to collect data by measuring MTTA and to set medium- to long-term goals and lead the team, such as formulating a strategy for cultivating a culture together with the team members.

2. Establish an Incident Commander

The Incident Commander (IC) is the role that is responsible for resolving the incident.
Except in some cases, the IC will not generally carry out recovery work, but will instead focus on communicating with relevant parties and issuing instructions to members.
The role of the IC becomes especially important as the incident becomes larger in scale, so it is a good idea to regularly schedule rotations to gain experience as an IC in minor incidents.
 
  • Responsible for the incident until it is resolved
  • System recovery decisions
  • Communication with all parties involved
  • Establishment of a recovery response team and communication network (Slack channel, etc.)
  • Instructions to work members
  • Creating Postmortem
 
The difference between an Incident Owner and an Incident Owner is that the position is established only when an incident occurs, and is merely a point-in-time role.
We are not responsible for continuous improvement, such as whether the incident management cycle is functioning properly.
 
 

3. Triage decisions

Triage (severity) is determined to determine the response priority for incidents that occur.
By setting up a triage system, it becomes easier to determine response priorities when multiple incidents occur, understand the importance of incidents from a third-party perspective (such as a business manager), and develop an incident response flow/structure.
 
At Ameba, triage is determined as follows: SEV1 - SEV5, and the response and system differ depending on each level.
For example, in the case of incidents of SEV3 or higher, it is mandatory to report the problem to users via the staff blog, and the IC handles the response, including user communication. (Note: Ameba does not have what would be called a communication commander from an SRE perspective. This is probably not necessary unless the service requires complex user communication.)
 
This allows engineers to focus solely on recovery work when a major incident occurs.
On the other hand, for incidents with SEV4 or below, the workflow is not complicated, so the engineer will be responsible for recovery response as the IC.
What you should pay attention toYou can rely on IC to make SEV decisions.That is the point.
If the SEV decision is delayed, the recovery of the disaster itself will also be delayed. It is important to create an environment where people can easily request the IC to make a SEV decision if necessary.
 
 

Conclusion


This time, we presented three items that are essential for introducing incident management.
In the next article, we will introduce specific workflows and problem-solving methods.
 
For reference, I've attached the slides I gave at a previous SRE-related event!
 
SRG is looking for people to work with us. If you're interested, please contact us here.