HOME/Articles/Datadog + Dify + Slack make capacity planning easy

Datadog + Dify + Slack make capacity planning easy

2025/2/18 15:042025/2/20 17:30

My name is Kataoka and I work in the Service Reliability Group (SRG) of the Media Headquarters.

#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.

This article was written at the SRG TechFest hackathon event held within SRG. This is an introduction to a PoC created to make capacity planning easier.

SRG TechFest ？Datadog + Dify +Slack Creation Difficulties Improvement points result Conclusion

SRG TechFest ？

Before getting into the main topic, let me briefly explain what SRG TechFest is. It is a hackathon event held by SRG every quarter. Each time, a theme is decided, and individuals or teams think about it and work on it for the whole day.

This time, we decided to work in a team of 3-4 people under the theme of "Leveraging generative AI from an SRE perspective."

*SRG also holds various other activities, such as workshops (study sessions) every quarter and weekly reading groups for volunteers.

Datadog + Dify +Slack

This time, our team wanted to make capacity planning easier, so we decided to leave it to AI.

SRG has a Dify environment that members can use freely, so we explored ways to use it to link with Datadog, which is used for many services. We also decided to notify the results of capacity planning on Slack.

Creation

The configuration is as follows:

Steps

Specify Datadog Dashboard ID and TimeRange

Get all the dashboard information with the Datadog API

LLM: Generate related Metrics Queries from the results obtained via the API

Extracting data as an array of Query

Iterate over the Query (run the Datadog QueryMetrics API)

Data conversion (Array to String)

LLM: Capacity planning report creation *Make sure to use Slack-friendly notation

Analyzing the iteration results

Difficulties

The example of "Please output like this" is used as the data to answer the question (the example is surprisingly bad)

Results change frequently each time a prompt is run

2, 7 Modify the prompts to see what data is available and perform the desired action.

Improvement points

Variable conversion in the query does not replace information contained in the data, but sends it as is, so add detailed instructions

Prompts to the system provide more detail about what data is in the context, improving accuracy

The API response passed to LLM did not include Query/Metrics information, which caused low accuracy. We implemented a measure to incorporate Query information into the API response.

result

They provided us with a summary of the dashboard for the given period and raised any concerns we had.

However, it seems that the suggestions for improvement are just general suggestions. In fact, the output at the beginning was really misguided, so this was still a lot better. (If I recall correctly, most of the time I was fighting with the prompts.)

Conclusion

In the end, it wasn't enough to be used for capacity planning, but even in the short time I was able to actually use it, I felt that the accuracy was gradually improving by making improvements, so I'd like to try again when I have the time.

SRG is looking for people to work with us. If you are interested, please contact us here.

Recruitment information - CyberAgent SRG #ca_srg

About SRG SRG (Service Reliability Group) is working to improve reliability by promoting the introduction of SREs to the media business as a cross-sectional SRE under the vision of "improving reliability across the media business." The work is centered around the following three pillars: Consolidating and deploying the technical know-how of each business

https://ca-srg.dev/careers