MySQL Monitoring Best Practices

This is Oniumi (@fat47) from the Service Reliability Group (SRG) of the Technology Headquarters.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
This article is located in SRGDBWGWe will publish materials regarding the database provided by the DB Working Group for the entire company.
I hope this helps in some way.
 

Purpose of this document


This article summarizes the key points of monitoring when operating MySQL.
We will also explain how to obtain metrics and each item.
 

The value that should trigger the alert


Set up alert triggers so that you can be notified as soon as a problem occurs with MySQL.
Warning
 

MySQL process alive monitoring

This monitors whether the MySQL process itself is alive.
Enables detection of sudden process death due to OOM Killer or other reasons.
 

Monitoring MySQL process startup time

You can check the time since MySQL was started by using the following command:
If the time here is short, it means that MySQL has restarted.
 

Monitoring Replication Status

This monitors whether the slave is replicating normally.
Slave_SQL_Running
Seconds_Behind_Master
 

Monitoring the number of connections

Monitor the ratio of (current number of connections) / (maximum number of connections).
If you exceed the limit, you will get a MySQL connection error "Too many connections."
Set a generous upper limit
 

Monitoring remaining disk space

This involves monitoring the OS, but database data usage often continues to increase, so it is important to monitor it appropriately.
If disk usage reaches its limit, there will be limited action you can take, so it is best to set up an alert to let you know when usage exceeds about 70%.
 

Is read_only turned on on the slave?

This setting is to prevent updates such as UPDATE from being executed by mistake on the slave database.
In the case of MySQL root user, this read_only is ignored and UPDATE etc. is executed, so do not use root user for normal operation.
💡
Starting with MySQL 5.7, a setting called `super_read_only` has been added that also prohibits updates by root.
 

Metrics to look out for and how to obtain them


Resource metrics help you investigate the cause of failures
By comparing metrics at three points in time - before the failure, during the failure, and after the failure has been resolved -
This will make it easier to identify the cause.
 
MySQL resource metrics are often graphed based on the results of SHOW GLOBAL STATUS etc. Each item in SHOW GLOBAL STATUS has two values: the value at that time (instantaneous value) and the cumulative value from when MySQL started to the present (cumulative value).
Since it is not clearly stated which value each corresponds to, we recommend using a pre-defined solution to obtain it.
 

Number of slow logs

Queries that take longer than the specified threshold to execute are recorded.
 

Current number of threads

The number of threads MySQL is currently running.
💡
If this metric is significantly higher than normal, it is likely due to an application problem that is causing an abnormal number of connections, or the application is continually reconnecting because the database is slowing down for other reasons.
 

Number of queries issued per second

This can be achieved using the solution introduced in the following chapter.
Check how many SELECT and UPDATE requests are coming in per second.
💡
If these have jumped significantly compared to before, suspect that there may be more access than expected, or that the number of queries has increased due to an application bug. Conversely, if the number of queries has not increased but latency has worsened, suspect that there may be an increase in single, heavy queries.
 

Query Latency

Check the average latency of the query itself and the number of slow queries that exceed a threshold.
These can also be achieved with the following solutions:
If the number of heavy queries is increasing, determine whether the increase in the number of queries (number of accesses) is causing the database itself to become heavy and slow, or whether the increase in accesses is not that great and the queries themselves are expensive, causing the database to become heavy.
 

Replication Lag in Seconds

The number of seconds replication is delayed.
Check whether replication is consistently delayed or if it is intermittently delayed and then converges.
In the former case, it is possible that a very heavy update query was executed or that there is a deadlock or other reason.
In the latter case, if the number of queries is increasing, it is possible that the performance of the DB slave is not keeping up.
 

Metrics Acquisition Solution


In the solution below,
SHOW ENGINE INNODB
In addition, some information is obtained from tables in the performance_schema.

Percona Monitoring and Management(PMP)

Solutions developed by Percona
There is a PMM Server that stores information and a PMM Client that collects it.
A complete set of resource information is provided on the dashboard.

Resource monitoring in SaaS

Mackerel provides a plugin for collecting MySQL resources.
 
Datadog includes MySQL resource collection functionality in the standard Datadog agent package.
Get started with Datadog
The Datadog Agent can collect a number of metrics from your MySQL database, including (among other things): Query throughput Query performance, such as average query execution time and slow queries Connections, such as currently open connections, aborted connections, and errors Buffer pool metrics, such as InnoDB metrics You can also create your own metrics using custom SQL queries. Note: This integration is also compatible with MariaDB, as MariaDB is a "compatible product" of MySQL. MySQL checks are included in the Datadog Agent package; no additional installation is required on the MySQL server. Preparing MySQL On each MySQL server, create a database user for the Datadog Agent. The following steps grant the Agent permission to log in from any host using datadog@'%' . You can restrict the datadog user to only log in from the localhost by using datadog@'localhost' . For more information, see Adding MySQL Accounts, Assigning Privileges, and Deleting Accounts. For MySQL 8.0+, create a datadog user using the native password hashing method. Verify that the user was created successfully with the following command, replacing with the password you created above: The Agent requires some permissions to collect metrics. Grant only limited permissions to the user, as follows: For MySQL 8.0 and later, set max_user_connections as follows: Once enabled, you can collect metrics from the performance_schema database by granting additional permissions. Follow the steps below to configure this check for an Agent running on a host. For container environments, see the Docker, Kubernetes, or ECS sections. To configure this check for an Agent running on a host: Available in Agent version 6.0 and later Run the Agent status subcommand and look for mysql in the Checks section: mysql.performance.user_connections (gauge) The number of user connections.
 
If you are using AWS RDS (Aurora), enable Performance Insights.

Conclusion


SRG is looking for people to work with us. If you're interested, please contact us here.