HOME/Technical Articles/[New Feature] Automatic troubleshooting verification using AWS DevOps Agent and points to note

[New Feature] Automatic troubleshooting verification using AWS DevOps Agent and points to note

2025/12/6 16:572025/12/11 8:05

✅

This article was 100% written by a human.

This is Onkai Yuta (@fat47) from the Service Reliability Group (SRG) of the Media Headquarters.

#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.

This article isCyberAgent Group SRE Advent Calendar 2025This is the article for the 11th day.

We have summarized the results of our testing of the DevOps Agent announced at re:Invent 2025.

What is AWS DevOps Agent?Steps to use DevOps Agent 0. Preparation 1. Creating a DevOps Agent Space 2. Instruct DevOps Agent to investigate Current status of DevOps Agent 1. The scope of the DevOps Agent is limited to investigating the root cause of failures, proposing recovery measures, and proposing preventive measures; executing recovery and preventive measures is not included in the scope.2. Usage is free during the preview period, but there are limitations.3. In the initial state, the DevOps Agent will not investigate the problem unless you issue a command to investigate it from the UI screen.4. Differences in response status on the monitoring service side 5. You can’t investigate inside the EC2 server Conclusion

What is AWS DevOps Agent?

DevOps Agent is a new feature announced at re:Invent 2025. It is an AI agent that can identify the root cause of failures from metrics and logs and propose mitigation and prevention measures.

It is currently available as a public preview in the Northern Virginia (us-east-1) region only.

In addition to CloudWatch as an alert detection source and telemetry collection target, it is also possible to integrate with the following monitoring services.

Datadog

Dynatrace

New Relic

Splunk

For more details, please refer to the official AWS blog.

AWS DevOps Agent helps you accelerate incident response and improve system reliability (preview) | Amazon Web Services

New service acts as an always-on DevOps engineer, helping you respond to incidents, identify root causes, and prevent future issues through systematic analysis of incidents and operational patterns.

https://aws.amazon.com/blogs/aws/aws-devops-agent-helps-you-accelerate-incident-response-and-improve-system-reliability-preview/

Steps to use DevOps Agent

0. Preparation

We created a simple web system to make it easier to visualize how a problem is investigated.

It is a web system that simply displays a list of car inventory.

The components are as follows:

Cloudfront

API Gateway

Lambda

Aurora MySQL Serverless v2

Every time a user accesses the web system, Lambda executes a SELECT query on Aurora MySQL.

Although it is not written in the components, set up CloudWatch alarms to detect errors in the Lambda and API Gateway access logs.

1. Creating a DevOps Agent Space

Set the region to us-east-1, open DevOps Agent, and click Begin Setup.

Enter an appropriate Agent Space name and create it.

The other inputs can be left as default for now.

Once created, the following screen will appear, so click "Operator Access".

This is all you need to do to complete the minimum setup.

Next, open the DevOps Center tab.

Switch Show to Components to see the components running within that AWS account.

2. Instruct DevOps Agent to investigate

First, we will intentionally create a fault.

Let's try deleting the part of the security group used by Aurora MySQL that allows access from Lambda.

This resulted in inaccessibility of the Lambda → Aurora MySQL section, causing an error in the web system.

If you check the CloudWatch alarms you set up in advance, you can see that the Lambda and API Gateway alarms are active.

Now, let's ask the DevOps Agent to investigate the problem.

Issue instructions for investigating the problem in the DevOps Agent's Incident Response tab. (English only)

Even if you don't enter anything, you can click the button that appears, "Latest alarm," and it will create some nice instructions for you.

Investigation details:

Once you have entered the above information, click Start investigating.

You can then see the progress of the investigation in real time.

Wait a few minutes for this to complete.

After a few minutes, I seem to have found the cause.

📖

Userkikai_yuta teeth,01:48:05Z and 01:48:26Zto the RDS security group (sg-0f647dad8ffda5934) and removed the inbound rules for the critical security groups.

In the first deletion,getCars Lambda functionSecurity Groups (sg-049a4429b1733b7e6)but Port 3306 on the Aurora RDS clusterThe rule that allowed you to connect to was removed.

Without this rule, all connection attempts from Lambda to the database would be blocked at the network layer, resulting inTCP connection timeout error after a timeout period of approximately 10 secondsoccurs.

This network-level blocking explains the following:

Lambda logs show "connect ETIMEDOUT" error for 100% of all invocations

RDS metrics report 0 database connections throughout the incident period

Performance Insights shows 0 database load even though Lambda is attempting to connect

Each connection timeout manifests as a Lambda execution time greater than 10 seconds, triggering a CloudWatch alarm.

Click the "Go to root cause" button that appears.

The following screen will then appear, so click "Generate mitigation plan."

(A mitigation plan is the creation of mitigation measures)

Again, wait a few minutes for this to complete.

Once the mitigation plan is complete, suggestions will be made for each step as follows:

Step 1 is preparation, and suggests AWS CLI commands to check the current status.

Next, in Step 2, as a preliminary verification, AWS CLI commands are suggested to check whether the problem is still occurring and whether the Aurora cluster is running.

Step 3 suggests an AWS CLI command to apply the fix. This is the command to change the security group settings.

In Step 4, AWS CLI commands are suggested to perform a post-mortem and investigate whether the system has recovered.

Finally, a Rollback suggestion.

This command is proposed to restore the security group that was modified by the proposed restoration operation to its original state.

Basically, you just need to follow steps 1 to 4.

Now, try the recovery command suggested in Step 3.

Then run the command suggested in Step 4 and check the results.

When I tried accessing the web system again, the display was successfully restored.

Finally, open the Prevention tab of the DevOps Agent and click Run.

We then begin investigating whether there are any effective ways to prevent the incident.

Once completed, the following screen will appear:

When you open Recommendations, you will see a report of preventative measures, with suggestions on what to do to prevent the same problem from occurring.

This concludes a simple DevOps Agent operation check.

Although not mentioned in this article, there is also the Runbook function.

You can configure it by opening the DevOps Agent Space and clicking the gear icon in the upper right corner.

By using Runbook to write service-specific information in Markdown, you can effectively perform DevOps Agent investigations.

Example of information to include in the Runbook:

Current status of DevOps Agent

1. The scope of the DevOps Agent is limited to investigating the root cause of failures, proposing recovery measures, and proposing preventive measures; executing recovery and preventive measures is not included in the scope.

The DevOps Agent is not in scope to execute recovery commands.

If you really want automatic recovery, you will need to implement your own AI agent and integrate it with the recovery plan proposed by the DevOps Agent.

2. Usage is free during the preview period, but there are limitations.

DevOps Agent is free to use during the preview period, but has the following limitations:

⚠️

10 Agent spaces per account

20 hours of incident response time per month

10 hours of incident prevention per month

1,000 chat messages per month

3. In the initial state, the DevOps Agent will not investigate the problem unless you issue a command to investigate it from the UI screen.

Even if a CloudWatch alarm is triggered, it will not be automatically investigated by default.

There are three main ways to start a problem investigation:

A. Built-in integrations

Integrate with ticketing systems such as ServiceNow to automatically start incident investigations from tickets

B. WebSockets/Webhooks

AWS DevOps Agent sends events via WebSockets and automatically triggers alarms in Datadog and Dynatrace

C. Manually

Manually started from the DevOps Agent Space web UI

Incident Responce

To automatically launch an investigation, you need to set up integration with Datadog, Dynatrace, or New Relic to receive alerts.

4. Differences in response status on the monitoring service side

Even when working with the above monitoring services, the situation varies depending on the service.

Dynatrace

Setup is possible simply by linking your Dynatrace account

This is because AWS hosts the Dynatrace MCP server for the DevOps Agent.

Connecting Dynatrace - AWS DevOps Agent

Insert abstract text

https://docs.aws.amazon.com/devopsagent/latest/userguide/configuring-capabilities-connecting-telemetry-sources-dynatrace.html

Datadog

Requires the use of Datadog's Remote MCP server. However, it is currently in private preview by request.Only available to a limited number of customersis.

Datadog MCP Server

Connect AI agents to Datadog observability data using the MCP Server to query metrics, logs, traces, and other insights.

https://docs.datadoghq.com/ja/bits_ai/mcp_server/

New Relic

Requires the use of New Relic's Remote MCP server. Since this is a public preview, it is available to general users.

New Relic AI Model Context Protocol (MCP)

Connect AI development tools to New Relic's observability platform through a standardized protocol for seamless data access and intelligent insights.

https://docs.newrelic.com/jp/docs/agentic-ai/mcp/overview/

Splunk

Requires use of Splunk's Remote MCP server.

Splunk Docs

undefined

https://help.splunk.com/en/splunk-cloud-platform/mcp-server-for-splunk-platform/about-the-mcp-server-for-splunk-platform

CloudWatch

CloudWatch alone cannot currently trigger a DevOps Agent problem investigation.

You need to trigger a CloudWatch alarm and connect it to the DevOps Agent via EventBridge/Lambda etc. via a webhook.

5. You can’t investigate inside the EC2 server

The DevOps Agent can investigate CloudWatch logs and metrics, the status of each component, etc., but it cannot directly check the logs or process status within the EC2 server.

Let's say you have Apache, Tomcat, and MySQL running on the same EC2 server, and MySQL is experiencing a high load.

The DevOps Agent can tell you that the CPU load on your EC2 server is high, but it cannot tell you what the cause is.

To properly notify the load status of your EC2 servers, you need to send detailed metrics to an external service such as Datadog and have the DevOps Agent investigate them.

Conclusion

By using DevOps Agent, it is possible to automate the investigation of the cause of failures and the proposal of recovery procedures. It was a very promising product!

This is still a preview release, so please feel free to try it out and send your feedback to AWS!

If you are interested in SRG, please contact us here.

Recruitment Information - CyberAgent SRG #ca_srg

About SRG SRG (Service Reliability Group) is working to improve reliability by promoting the introduction of SRE to the media business as a cross-sectional SRE, based on the vision of "improving reliability across the media business." The work is centered around the following three areas: Consolidating and deploying the technical know-how of each business

https://ca-srg.dev/careers