Incident Management from Scratch - Introduction

This article isCyberAgent Group SRE Advent Calendar 2024This is the 12th article.
 
Media Headquarters Service Reliability Group (SRG)tako_sonomono)is.
#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.
 
This article will present the purpose of introducing incident management and the three initial elements to be established.
We will provide separate information on tools for smooth incident management, as well as specific workflows and problem-solving methods.
 
 

Introduction


In the course of operating our systems on a daily basis, we come across many alerts.
While problems that could be addressed when the system or organization was small, as the system and organization scale, many challenges arise, such as the following:
  1. The system is constantly issuing alerts without a person in charge
  1. Specific developers are handling alerts, and the process is becoming increasingly individualized.
  1. High recurrence rate of past incidents
  1. There is no established response flow in the event of an incident that would have a major impact on users, and recovery takes a long time.
  1. There are no set criteria for determining whether or not a user announcement is necessary, so a supervisor must make a decision each time.
  1. There was a monitoring setting error, and the problem was noticed through a user inquiry.
These issues ultimately result in delays in disaster recovery, which in turn undermines user confidence in the service.
 
We use incident management to resolve these issues.
From here on, I will explain what should be done to solve these problems, using the example of Ameba as an introduction.
 

The purpose of incident management


SRG provides incident management for each of our media service businesses, and is also promoting activities to resolve the above issues at Ameba. Here, the main objectives of incident management are the following three points:
 
  1. Reduce MTTR (Mean Time to Repair)
  1. Reduce the number of incidents
  1. Suppression of decline in user satisfaction

1. Reduction of MTTR (Mean Time to Repair)

The time it takes for a system to recover from a failure or problemaverageRefers to:
Incident response operation cycleContinuousThe aim is to shorten the time from when a failure occurs to when it is restored by improving this.
 

2. Reducing the number of incidents

The aim is to reduce the occurrence of incidents themselves by conducting postmortem and eliminating the seeds that lead to incidents in advance.
 

3. Preventing declines in user satisfaction

If a service is unavailable for several hours without any announcement from the service side, users will no longer trust the service. The purpose of this project is to prevent a decline in trust in the service by optimizing communication to users when an incident occurs.
 
The specific steps to achieve these goals are explained below.

Introduction


  1. Establishing Incident Ownership
  1. Establishing an Incident Commander
  1. Triage decisions
 

1. Establish an Incident Owner

The Incident Owner is responsible for "introducing an incident management culture into the business andContinuousThis role is one in which the person in charge is responsible for making improvements.
Since it is very difficult to quantitatively measure the effectiveness of incident management after its implementation, it is not easy to introduce it to all organizations. In particular, for small organizations where the issues are not expanding, it is difficult to invest resources, so incident owners need to consider the state of their business before introducing it.
 
Even after the decision to implement the system has been made, continuous improvement is important in incident management, as seen in the term MTTR (mean time to recovery). If necessary, someone is needed to set medium- to long-term goals and lead the team, such as collecting data by measuring MTTA and formulating a strategy for cultivating a culture together with the team members.

2. Establish an Incident Commander

The Incident Commander (IC) is the role that is responsible for resolving an incident.
Except for a few cases, the IC will not generally carry out recovery work, but will focus on communicating with relevant parties and issuing instructions to members.
The role of the IC becomes especially important the larger the incident, so it is a good idea to arrange a rotation to gain experience as an IC in minor incidents.
 
  • Responsible for the incident until it is resolved
  • System recovery decisions
  • Communication to relevant parties
  • Establishment of a recovery response team and communication network (Slack channel, etc.)
  • Instructions to the work team
  • Creating a Postmortem
 
The difference between an Incident Owner and an Incident Owner is that the position is established only when an incident occurs, and is therefore a point-in-time role.
We are not responsible for continuous improvement, such as whether the incident management cycle is functioning properly.
 
 

3. Triage decisions

Triage (severity) is determined to determine the response priority for the incident that occurs.
Setting up triage makes it easier to determine response priorities when multiple incidents occur, understand the importance of incidents from a third-party perspective (such as a business manager), and develop an incident response flow/structure.
 
At Ameba, triage is determined as follows, from SEV1 to SEV5, and the response and system differ depending on each level.
For example, in the case of SEV3 or higher incidents, it is mandatory to report the problem to users via the staff blog, and the IC handles the response, including user communication. (*Ameba does not have a communication commander, as it is called from an SRE perspective. It is not necessary unless the service requires complex user communication.)
 
This allows engineers to focus solely on recovery efforts when a major incident occurs.
On the other hand, for incidents with SEV4 or below, the workflow is not complicated, so the engineer will be responsible for recovery response as the IC.
What to note isSEV judgment can be left to ICThat is the point.
If the SEV decision is delayed, the recovery from the disaster itself will also be delayed. It is important to create an atmosphere where people can easily request the IC to make an SEV decision if necessary.
 
 

Conclusion


This time, we presented three items that are essential when introducing incident management.
In the next article, we will introduce specific workflows and problem-solving methods.
 
For reference, I've attached the slides I gave at a previous SRE-related event!
 
SRG is looking for people to work with us. If you are interested, please contact us here.