Datadog + Dify + Slack make capacity planning easy
My name is Kataoka and I work in the Service Reliability Group (SRG) of the Media Headquarters.
#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.
This article was written at the SRG TechFest hackathon event held within SRG. This is an introduction to a PoC created to make capacity planning easier.
SRG TechFest ?
Before getting into the main topic, let me briefly explain what SRG TechFest is. It is a hackathon event held by SRG every quarter. Each time, a theme is decided, and individuals or teams think about it and work on it for the whole day.
This time, we decided to work in a team of 3-4 people under the theme of "Leveraging generative AI from an SRE perspective."
*SRG also holds various other activities, such as workshops (study sessions) every quarter and weekly reading groups for volunteers.
Datadog + Dify +Slack
This time, our team wanted to make capacity planning easier, so we decided to leave it to AI.
SRG has a Dify environment that members can use freely, so we explored ways to use it to link with Datadog, which is used for many services. We also decided to notify the results of capacity planning on Slack.
Creation
The configuration is as follows:

Steps
- Specify Datadog Dashboard ID and TimeRange
- Get all the dashboard information with the Datadog API
- LLM: Generate related Metrics Queries from the results obtained via the API
- Extracting data as an array of Query
- Iterate over the Query (run the Datadog QueryMetrics API)
- Data conversion (Array to String)
- LLM: Capacity planning report creation *Make sure to use Slack-friendly notation
- Analyzing the iteration results
Difficulties
- The example of "Please output like this" is used as the data to answer the question (the example is surprisingly bad)
- Results change frequently each time a prompt is run
- 2, 7 Modify the prompts to see what data is available and perform the desired action.
Improvement points
- Variable conversion in the query does not replace information contained in the data, but sends it as is, so add detailed instructions
- Prompts to the system provide more detail about what data is in the context, improving accuracy
- The API response passed to LLM did not include Query/Metrics information, which caused low accuracy. We implemented a measure to incorporate Query information into the API response.
result
They provided us with a summary of the dashboard for the given period and raised any concerns we had.
However, it seems that the suggestions for improvement are just general suggestions. In fact, the output at the beginning was really misguided, so this was still a lot better. (If I recall correctly, most of the time I was fighting with the prompts.)


Conclusion
In the end, it wasn't enough to be used for capacity planning, but even in the short time I was able to actually use it, I felt that the accuracy was gradually improving by making improvements, so I'd like to try again when I have the time.
SRG is looking for people to work with us. If you are interested, please contact us here.