Proof of Concept (PoC) for Incident Evacuation Drills Utilizing Generative AI with Slack + AWS Chatbot + Bedrock

Yuta Kikai of the Service Reliability Group (SRG) in the Media Management Division@fat47)is.
#SRGThe Service Reliability Group primarily provides comprehensive support for the infrastructure surrounding our media services, focusing on improving existing services, launching new ones, and contributing to open-source software (OSS).
 
This article introduces a proof-of-concept (PoC) for incident evacuation drills using Slack workflows, AWS Chatbot, and Bedrock, which we created during our team's hackathon event.
 

SRG's one-day hackathon event, "SRG TechFest"


Our team holds a hackathon event once every quarter, in which all team members participate.
In this event, each team will come up with a challenge related to the theme and work on it for a full day.
The overall theme for this session was "Utilizing Generative AI from an SRE Perspective."
 
My team tried the following two things:
  • Incident evacuation drills utilizing generational AI
  • Utilizing a local LLM for Slack search
This article will focus on the former, "incident evacuation drills utilizing generational AI."
 

Incident evacuation drill utilizing generation AI


Reasons for using incident evacuation drills as a theme

An incident evacuation drill is a drill that simulates an actual incident to confirm escalation flows and response flows.
Doing this has several benefits, including improving our ability to respond to actual incidents and helping to develop junior members.
However, you may find the preparations for implementation difficult, or you may not know where to begin.
 
Therefore, we thought that by using a generation AI to simulate an incident and considering how to respond to it, we could gain insights that could be applied to actual incident response.

composition

A Slack workflow triggers a query to the AWS Chatbot, which then communicates with Amazon Bedrock and provides a response.
 
With the September 2024 update, it became possible to communicate with Amazon Bedrock from Slack using AWS Chatbot without writing any code.
 
This project was also created without writing any code.

Introduction of what I created

Click the button to launch the workflow from the Slack channel.
 
Once the workflow starts, select the service name and system configuration for that system and submit it.
 
The system configuration information is posted as a message and sent to AWS Chatbot.
 
Then, a few seconds later, the chatbot will send the problem statement as a reply to the thread.
 
Click the "Submit Response" button, enter your proposed course of action in text, and submit it.
 
Depending on the situation, you may be asked additional questions in return.
 
Press the submit button and then enter your answer.
 
Finally, you will receive a score, evaluation, and explanation of your actions up to that point.

thoughts

Good points
  • The accuracy is rough, but it could be a good starting point for thinking about initial responses to incidents.
  • Advice on "what you should have done" could be useful for raising the overall level of junior players.
 
Areas for improvement
  • Whether the escalation flow works requires actual training involving people.
  • The scoring is lenient, and even if the survey content is vague, high scores can be obtained.
  • It would be good to have the next steps in this incident evacuation drill workflow ready.
    • For example, if an escalation flow is not yet established, provide guidance to the information necessary to establish the flow.
 

In conclusion


It's still quite rough around the edges, but I think I've started to see a direction for achieving results.
The event was a great opportunity, and I enjoyed being able to tackle aspects that I wouldn't normally have time to address in my regular work.
 
Going forward, we'd like to have people outside the team try it out and use their feedback to make improvements.
 
SRG is looking for new team members. If you are interested, please contact us here.