Datadog + Dify + Slack: Capacity Planning Made Easy

My name is Kataoka and I work in the Service Reliability Group (SRG) of the Media Headquarters.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
This article was written at SRG TechFest, a hackathon event held within SRG. This is an introduction to a PoC created to make capacity planning easier.

SRG TechFest ?


Before getting into the main topic, let me briefly explain what SRG TechFest is. It is a hackathon event held by SRG every quarter. Each time, a theme is decided, and individuals and teams think about it and work on it for the entire day.
This time, we decided to work in a team of 3-4 people under the theme of "Leveraging generative AI from an SRE perspective."
*SRG also holds various other activities, such as workshops (study sessions) every quarter and weekly volunteer reading groups.

Datadog + Dify +Slack


This time, our team wanted to make capacity planning easier, so we decided to leave it to AI.
SRG has a Dify environment that members can use freely, so we explored ways to utilize this to integrate with Datadog, which is used for many services. We also set up a system to notify the results of capacity planning via Slack.

Creations

The configuration is as follows:
Steps
  1. Specify Datadog Dashboard ID and TimeRange
  1. Get all your dashboard information with the Datadog API
  1. LLM: Generate a related Metrics Query from the results obtained by the API.
  1. Extracting data as an array of Query
  1. Iterate over the Query (using the Datadog QueryMetrics API)
  1. Data conversion (Array to String)
  1. LLM: Capacity Planning Report Creation (Slack-friendly notation)
  1. Analyzing the iteration results

Difficulties

  • The example of "Please output like this" is used as the data to answer the question (the example is surprisingly bad)
  • The results change frequently between prompt runs
  • 2, 7 What data is there and modify the prompt until the desired processing is performed

Improvement points

  • Variable conversion in the query does not replace information contained in the data, but sends it as is, so add detailed instructions
  • Prompts to the system provide more information about the data in the context, improving accuracy.
  • The API response passed to LLM did not include query/metrics information, which caused low accuracy. We implemented a measure to incorporate query information into the API response.

result

They organized a dashboard summary for the given period and pointed out any concerns.
However, it seems that the suggestions for improvement are merely general suggestions. In fact, the output at the beginning was completely irrelevant, so this was actually a much better result. (Looking back, I think most of the time it was a battle with the prompts.)

Conclusion


In the end, it wasn't good enough to be used for capacity planning, but even in the short time I was able to actually work on it, I felt that the accuracy was gradually improving as I made improvements, so I'd like to try again when I have time.
 
SRG is looking for people to work with us. If you're interested, please contact us here.