Recording slow queries and analyzing them using pt-query-digest

This is Onikai (@fat47) from the Service Reliability Group (SRG) of the Technology Division.
#SRGThe Service Reliability Group primarily provides comprehensive support for the infrastructure surrounding our media services, focusing on improving existing services, launching new ones, and contributing to open-source software (OSS).
This article is located within SRG.DBWGThe DB Working Group is releasing database-related materials that it provides to the entire company.
I hope this is of some help.
 

What is a slow query?


Query times exceeding the `long_query_time` value specified in the MySQL settings will be logged.
Generally, slow queries are those where indexes are not properly set up or that involve calling a large number of rows.
These queries are often high-cost and therefore place a heavy load on the database.
 

Recommended settings for slow query logs


💡
This document outlines the configuration steps for running MySQL on-premises or on EC2.
 

Enable log output

Set in my.cnf
 
Alternatively, log in to MySQL and then execute the following:
 

Set up rotation with logrotate

With the current settings, slow queries will continue to be written to the same mysql-slow.log file.
To avoid making analysis difficult, we will rotate the data by date.
 
Run the test
 
Once the setup is complete, logs will be accumulated as shown below.
 

How to analyze slow query logs


When you examine the contents of the recorded slow query log, you'll see that queries exceeding the threshold are recorded in this format.
In actual logs, these entries can appear in the hundreds or even thousands, making it impossible to review them all.
 

Let's collect statistics on slow queries.

Let's use pt-query-digest, included in the Percona Toolkit, to collect statistics on slow logs.
 
Installation example on CentOS 7
 
Let's try running it without any options.
 
Let's check the output.
At the top, log statistics such as Exec Time and Lock Time are summarized.
The next section displays a ranking of services with slow response times and a high number of calls.
  • Response time This displays the total execution time of the query and the percentage of the total slow log that the query accounts for.
  • Calls This is the number of times the query was executed.
  • R/Call This is the average response time per execution (Response Time / Call).
  • V/M This is the variance-to-mean ratio of response time (?).
  • Item This displays a summary of the query.
 
The following section records aggregated information for each query.
  • Query_time distribution A graph showing the average query execution time is displayed.
  • Tables This query allows you to check the table information related to this query.
  • EXPLAIN This displays sample queries that can actually be verified using EXPLAIN.
In this log example, we can see that Query1 is an extremely slow query, with most responses taking between 100ms and 10 seconds.
 

Use the options of pt-query-digest to conduct the investigation.

As mentioned earlier, analysis can be performed even with the default settings,
You can perform the analysis more efficiently by utilizing the following options.
Here are some excerpts.
 
I want to search logs by specifying a date and time range.
 
I want to sort the queries by total execution time.
 
I want to sort by the number of lines read, from most to least.
 
I want to display the tables in order of the number of slow queries.
 

Analyze the query with EXPLAIN.


Once you've analyzed the slow query logs and identified queries that seem like they could be improved,
Now we'll start analyzing the query itself.
We plan to create a separate article on this topic, so please wait a while.
 

In conclusion


SRG is looking for new team members. If you are interested, please contact us here.