MySQL Monitoring Best Practices

This is Onikai (@fat47) from the Service Reliability Group (SRG) of the Technology Division.
#SRGThe Service Reliability Group primarily provides comprehensive support for the infrastructure surrounding our media services, focusing on improving existing services, launching new ones, and contributing to open-source software (OSS).
This article is located within SRG.DBWGThe DB Working Group is releasing database-related materials that it provides to the entire company.
I hope this is of some help.
 

The purpose of this document


This document summarizes key points for monitoring MySQL during its operation.
This section explains how to obtain metrics and describes each of the metrics items.
 

Values ​​to use as alert triggers


Set up alert triggers so you can be notified as soon as a problem occurs in MySQL.
Warning
 

MySQL process health monitoring

This monitors whether the MySQL process itself is still running.
This allows for detection of sudden process deaths due to reasons such as OOM Killer.
 

Monitoring MySQL process startup time

You can use the following command to check the time elapsed since MySQL started.
The short duration here indicates that a MySQL restart occurred.
 

Monitoring replication status

This monitors whether the slave has successfully established replication.
Slave_SQL_Running
Seconds_Behind_Master
 

Monitoring the number of connections

Monitor the ratio of (current number of connections) ÷ (maximum number of connections).
If the limit is exceeded, a MySQL connection error, "Too many connections," will occur.
Let's set a generous upper limit.
 

Monitoring remaining disk space

While this involves monitoring the OS, databases often experience continuous data usage increases, so it's important to monitor them appropriately.
If disk usage gets too close to the limit, the available solutions become limited, so it's best to set up an alert to notify you when usage exceeds around 70%.
 

Is read_only enabled on the slave device?

This setting prevents accidental updates such as UPDATE statements from being executed on the slave database.
If you are using the MySQL root user, this `read_only` directive will be ignored and UPDATE statements and other operations will be executed, so you should avoid using the root user for normal operational tasks.
💡
MySQL 5.7 and later versions also include a setting called `super_read_only` which prohibits updates even by root.
 

Key metrics and how to obtain them


Resource metrics are useful for investigating the root cause of failures.
By comparing metrics at three points in time—before the failure, during the failure, and after the failure is resolved—
This makes it easier to identify the cause.
 
MySQL resource metrics are often graphed based on results from commands like SHOW GLOBAL STATUS. Each item in SHOW GLOBAL STATUS has two types of values: the value at a specific point in time (instantaneous value) and the cumulative value from when MySQL was started until the present (cumulative value).
Since it's not explicitly stated which value each one represents, we recommend using a pre-prepared solution to obtain them.
 

Slow log entries

Queries that took longer than the specified execution threshold are recorded.
 

Current number of threads

This is the number of threads currently running in MySQL.
💡
If this metric is significantly higher than normal, it's often due to an application malfunction causing an abnormal number of connections, or because the database is slowing down for other reasons, causing the application to continuously reconnect.
 

Number of queries issued per second

This can be obtained using the solutions described in the following chapter.
We will check how many SELECT and UPDATE operations are coming in per second.
💡
If these figures have jumped significantly compared to before, suspect either that there is more traffic than expected or that the number of queries is increasing due to an application bug. Conversely, if the number of queries hasn't increased but latency has worsened, suspect that there may be an increase in single, heavy queries.
 

Query latency

This is checked by the average latency of the queries themselves and the number of slow queries that exceed a threshold.
These can also be obtained using the solutions described below.
If you're experiencing an increase in heavy queries, determine whether the database itself is slowing down due to an increase in the number of queries (accesses), or whether the database is slowing down because the queries themselves are costly and not significantly increasing access.
 

Replication delay in seconds

This value indicates the replication delay in seconds.
We will check whether replication is continuously delayed or if it is delayed intermittently and then converges repeatedly.
In the former case, it could be due to executing a very heavy update query or a deadlock.
In the latter case, if the number of queries has increased, it is possible that the DB slave's performance is no longer keeping up.
 

Metrics acquisition solution


In the following solution,
SHOW ENGINE INNODB
Additionally, some of the information is retrieved from the `performance_schema` table.

Percona Monitoring and Management(PMP)

A solution developed by Percona Inc.
There is a PMM Server that stores information and a PMM Client that collects it.
A complete set of resource information is available on the dashboard.

Resource Monitoring in SaaS

Mackerel provides a plugin for collecting MySQL resources.
 
Datadog includes MySQL resource collection functionality in its standard datadog agent package.
Let's get started with Datadog.
The Datadog Agent can collect a number of metrics from your MySQL database, including (examples): Query throughput Query performance such as average query execution time and slow queries Connections such as currently open connections, suspended connections, and errors InnoDB metrics such as buffer pool metrics You can also create your own metrics using custom SQL queries. Note: MariaDB is a "compatible product" of MySQL, so this integration is also compatible with MariaDB. MySQL checks are included in the Datadog Agent package. No additional installation is required on your MySQL server. Preparing MySQL On each MySQL server, create a database user for the Datadog Agent. The following steps grant the Agent permission to log in from any host using datadog@'%'. You can restrict the datadog user to log in only from localhost using datadog@'localhost'. For more information, see Adding, assigning privileges to, and deleting MySQL accounts. For mySQL 8.0+, create the datadog user using the native password hashing method. Verify that the user was created correctly using the following command: Replace with the password you created above. The Agent requires certain permissions to collect metrics. Grant users only limited permissions, as follows: For MySQL 8.0 and later, set max_user_connections as follows. Once enabled, you can collect metrics from the performance_schema database by granting additional permissions. To configure this check for an Agent running on a host, follow these steps: For container environments, see the Docker, Kubernetes, or ECS sections. To configure this check for an Agent running on a host: Available in Agent version 6.0 and later. Run the Agent's status subcommand and find mysql in the Checks section. mysql.performance.user_connections (gauge) The number of user connections.
 
If you are using AWS RDS (Aurora), be sure to enable Performance Insights.

In conclusion


SRG is looking for new team members. If you are interested, please contact us here.