MySQL Monitoring Best Practices
This is Onikai (@fat47) from the Service Reliability Group (SRG) of the Technology Division.
#SRGThe Service Reliability Group primarily provides comprehensive support for the infrastructure surrounding our media services, focusing on improving existing services, launching new ones, and contributing to open-source software (OSS).
This article is located within SRG.DBWGThe DB Working Group is releasing database-related materials that it provides to the entire company.
I hope this is of some help.
The purpose of this documentValues to use as alert triggersMySQL process health monitoringMonitoring MySQL process startup timeMonitoring replication statusMonitoring the number of connectionsMonitoring remaining disk spaceIs read_only enabled on the slave device?Key metrics and how to obtain themSlow log entriesCurrent number of threadsNumber of queries issued per secondQuery latencyReplication delay in secondsMetrics acquisition solutionPercona Monitoring and Management(PMP)Resource Monitoring in SaaSIn conclusion
The purpose of this document
This document summarizes key points for monitoring MySQL during its operation.
This section explains how to obtain metrics and describes each of the metrics items.
Values to use as alert triggers
Set up alert triggers so you can be notified as soon as a problem occurs in MySQL.
WarningMySQL process health monitoring
This monitors whether the MySQL process itself is still running.
This allows for detection of sudden process deaths due to reasons such as OOM Killer.
Monitoring MySQL process startup time
You can use the following command to check the time elapsed since MySQL started.
The short duration here indicates that a MySQL restart occurred.
Monitoring replication status
This monitors whether the slave has successfully established replication.
Slave_SQL_RunningSeconds_Behind_MasterMonitoring the number of connections
Monitor the ratio of (current number of connections) ÷ (maximum number of connections).
If the limit is exceeded, a MySQL connection error, "Too many connections," will occur.
Let's set a generous upper limit.
Monitoring remaining disk space
While this involves monitoring the OS, databases often experience continuous data usage increases, so it's important to monitor them appropriately.
If disk usage gets too close to the limit, the available solutions become limited, so it's best to set up an alert to notify you when usage exceeds around 70%.
Is read_only enabled on the slave device?
This setting prevents accidental updates such as UPDATE statements from being executed on the slave database.
If you are using the MySQL root user, this `read_only` directive will be ignored and UPDATE statements and other operations will be executed, so you should avoid using the root user for normal operational tasks.
MySQL 5.7 and later versions also include a setting called `super_read_only` which prohibits updates even by root.
Key metrics and how to obtain them
Resource metrics are useful for investigating the root cause of failures.
By comparing metrics at three points in time—before the failure, during the failure, and after the failure is resolved—
This makes it easier to identify the cause.
MySQL resource metrics are often graphed based on results from commands like SHOW GLOBAL STATUS. Each item in SHOW GLOBAL STATUS has two types of values: the value at a specific point in time (instantaneous value) and the cumulative value from when MySQL was started until the present (cumulative value).
Since it's not explicitly stated which value each one represents, we recommend using a pre-prepared solution to obtain them.
Slow log entries
Queries that took longer than the specified execution threshold are recorded.
Current number of threads
This is the number of threads currently running in MySQL.
If this metric is significantly higher than normal, it's often due to an application malfunction causing an abnormal number of connections, or because the database is slowing down for other reasons, causing the application to continuously reconnect.
Number of queries issued per second
This can be obtained using the solutions described in the following chapter.
We will check how many SELECT and UPDATE operations are coming in per second.
If these figures have jumped significantly compared to before, suspect either that there is more traffic than expected or that the number of queries is increasing due to an application bug.
Conversely, if the number of queries hasn't increased but latency has worsened, suspect that there may be an increase in single, heavy queries.
Query latency
This is checked by the average latency of the queries themselves and the number of slow queries that exceed a threshold.
These can also be obtained using the solutions described below.
If you're experiencing an increase in heavy queries, determine whether the database itself is slowing down due to an increase in the number of queries (accesses), or whether the database is slowing down because the queries themselves are costly and not significantly increasing access.
Replication delay in seconds
This value indicates the replication delay in seconds.
We will check whether replication is continuously delayed or if it is delayed intermittently and then converges repeatedly.
In the former case, it could be due to executing a very heavy update query or a deadlock.
In the latter case, if the number of queries has increased, it is possible that the DB slave's performance is no longer keeping up.
Metrics acquisition solution
In the following solution,
SHOW ENGINE INNODBAdditionally, some of the information is retrieved from the `performance_schema` table.
Percona Monitoring and Management(PMP)
A solution developed by Percona Inc.
There is a PMM Server that stores information and a PMM Client that collects it.
A complete set of resource information is available on the dashboard.
Resource Monitoring in SaaS
Mackerel provides a plugin for collecting MySQL resources.
Datadog includes MySQL resource collection functionality in its standard datadog agent package.
If you are using AWS RDS (Aurora), be sure to enable Performance Insights.
In conclusion
SRG is looking for new team members.
If you are interested, please contact us here.
