MySQL Monitoring Best Practices

This is Oniumi (@fat47) from the Service Reliability Group (SRG) of the Technology Headquarters.
#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.
This article is located in SRGDBWGWe will publish materials regarding the database that the (DB Working Group) provides to the entire company.
I hope this helps in some way.
 

Purpose of this document


This article summarizes the key points of monitoring when operating MySQL.
We will also explain how to obtain metrics and each item.
 

The value that should trigger the alert


Set up alert triggers so that you can be notified as soon as a problem occurs with MySQL.
Warning
 

MySQL process alive monitoring

This checks whether the MySQL process itself is alive.
Enables detection of sudden process death due to OOM Killer or other reasons.
 

MySQL process startup time monitoring

You can check the time since MySQL was started by using the following command.
If the time here is short, it means that MySQL has been restarted.
 

Monitoring Replication Status

This is to monitor whether the slave is replicating normally.
Slave_SQL_Running
Seconds_Behind_Master
 

Monitoring the number of connections

Monitor the ratio (current number of connections) / (maximum number of connections).
If you exceed the limit, you will get a MySQL connection error "Too many connections."
Set a generous upper limit
 

Disk space monitoring

This involves monitoring the OS, but since data usage in databases often continues to increase, it is important to monitor it appropriately.
If disk usage reaches too close to its limit, the measures that can be taken will be limited, so it is a good idea to set up an alert to notify you when usage exceeds about 70%.
 

Is read_only turned on on the slave?

This setting is to prevent accidentally executing updates such as UPDATE on the slave database.
In the case of the MySQL root user, this read_only is ignored and UPDATE etc. will be executed, so do not use the root user for normal operations.
💡
From MySQL5.7 onwards, a setting called `super_read_only` has been added that also prohibits updates by root.
 

Metrics to look out for and how to get them


Resource metrics help you investigate the cause of failures
By comparing metrics at three points in time, before the outage occurred, during the outage, and after the outage was resolved,
This will make it easier to pinpoint the cause.
 
MySQL resource metrics are often graphed based on the results of SHOW GLOBAL STATUS, etc. There are two types of values for each item in SHOW GLOBAL STATUS: the value at that time (instantaneous value) and the cumulative value from when MySQL started to the present (cumulative value).
Since it is not clear which value each corresponds to, we recommend using a pre-prepared solution to obtain them.
 

Number of slow logs

Queries that take longer than the specified threshold to execute are logged.
 

Current number of threads

The number of threads that MySQL is currently running.
💡
If this metric is significantly higher than normal, it is likely due to an application problem that is causing an abnormal number of connections to be made, or the application is continually reconnecting because the database is slowing down for other reasons.
 

Queries per second

This can be achieved using the solution presented in the following chapter.
Check how many SELECTs and UPDATEs are being made per second.
💡
If these have jumped significantly compared to before, suspect that there is more access than expected, or that a bug in the application has increased the number of queries. Conversely, if the number of queries has not increased but latency has worsened, suspect the possibility that there is an increase in single, heavy queries.
 

Query Latency

Check the average latency of the queries themselves and the number of slow queries that exceed a threshold.
These can also be achieved with the following solutions:
If the number of heavy queries is increasing, determine whether an increase in the number of queries (number of accesses) is making the database itself heavier and slower, or whether accesses are not increasing that much and the queries themselves are expensive, causing the database to become heavy.
 

Replication Lag Seconds

The number of seconds replication is delayed.
Check whether replication is consistently delayed or if it is intermittently delayed and then converges.
In the former case, possible reasons include a very heavy update query being executed or a deadlock.
In the latter case, if the number of queries is increasing, it is possible that the DB slave's performance is not keeping up.
 

Metrics Acquisition Solution


In the following solution:
SHOW ENGINE INNODB
Some information is also obtained from tables in the performance_schema.

Percona Monitoring and Management(PMP)

Solutions developed by Percona
There is the PMM Server that stores information and the PMM Client that collects it.
A range of resource information is provided on the dashboard.

SaaS Resource Monitoring

Mackerel provides a plugin for collecting MySQL resources.
 
Datadog includes MySQL resource collection functionality in the standard Datadog agent package.
Get started with Datadog
The Datadog Agent can collect a number of metrics from your MySQL databases, including (among others): Query throughput Query performance, such as average query execution time and slow queries Connections, such as currently open connections, aborted connections, and errors InnoDB, such as buffer pool metrics You can also create your own metrics using custom SQL queries. Note: This integration is also compatible with MariaDB, as MariaDB is a "compatible product" of MySQL. MySQL checks are included in the Datadog Agent package; no additional installation is required on your MySQL servers. Preparing MySQL On each MySQL server, create a database user for the Datadog Agent. The following steps grant the Agent permission to log in from any host using datadog@'%' . You can restrict the datadog user to only log in from localhost by using datadog@'localhost' . For more information, see Adding MySQL Accounts, Assigning Privileges, and Deleting Accounts. For mySQL 8.0+, create a datadog user using the native password hashing method. Verify that the user was created successfully with the following command, replacing with the password you created above: The Agent requires some permissions to collect metrics. Grant only limited permissions to the user, as follows: For MySQL 8.0 and later, set max_user_connections as follows: Once enabled, you can collect metrics from the performance_schema database by granting additional permissions. Follow the steps below to configure this check for an Agent running on a host. For container environments, see the Docker, Kubernetes, or ECS sections. To configure this check for an Agent running on a host: Agent Available in version 6.0 and later Run the Agent status subcommand and look for mysql in the Checks section. mysql.performance.user_connections (gauge) The number of user connections.
 
If you use AWS RDS (Aurora), enable Performance Insights.

Conclusion


SRG is looking for people to work with us. If you are interested, please contact us here.