MySQL Monitoring Best Practices
This is Oniumi (@fat47) from the Service Reliability Group (SRG) of the Technology Headquarters.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
This article is located in SRGDBWGWe will publish materials regarding the database provided by the DB Working Group for the entire company.
I hope this helps in some way.
Purpose of this documentThe value that should trigger the alertMySQL process alive monitoringMonitoring MySQL process startup timeMonitoring Replication StatusMonitoring the number of connectionsMonitoring remaining disk spaceIs read_only turned on on the slave?Metrics to look out for and how to obtain themNumber of slow logsCurrent number of threadsNumber of queries issued per secondQuery LatencyReplication Lag in SecondsMetrics Acquisition SolutionPercona Monitoring and Management(PMP)SaaS resource monitoringConclusion
Purpose of this document
This article summarizes the key points of monitoring when operating MySQL.
We will also explain how to obtain metrics and each item.
The value that should trigger the alert
Set up alert triggers so that you can be notified as soon as a problem occurs with MySQL.
Warning
MySQL process alive monitoring
This monitors whether the MySQL process itself is alive.
Enables detection of sudden process death due to OOM Killer or other reasons.
Monitoring MySQL process startup time
You can check the time since MySQL was started by using the following command:
If the time here is short, it means that MySQL has restarted.
Monitoring Replication Status
This monitors whether the slave is replicating normally.
Slave_SQL_Running
Seconds_Behind_Master
Monitoring the number of connections
Monitor the ratio of (current number of connections) / (maximum number of connections).
If you exceed the limit, you will get a MySQL connection error "Too many connections."
Set a generous upper limit
Monitoring remaining disk space
This involves monitoring the OS, but database data usage often continues to increase, so it is important to monitor it appropriately.
If disk usage reaches its limit, there will be limited action you can take, so it is best to set up an alert to let you know when usage exceeds about 70%.
Is read_only turned on on the slave?
This setting is to prevent updates such as UPDATE from being executed by mistake on the slave database.
In the case of MySQL root user, this read_only is ignored and UPDATE etc. is executed, so do not use root user for normal operation.
Starting with MySQL 5.7, a setting called `super_read_only` has been added that also prohibits updates by root.
Metrics to look out for and how to obtain them
Resource metrics help you investigate the cause of failures
By comparing metrics at three points in time - before the failure, during the failure, and after the failure has been resolved -
This will make it easier to identify the cause.
MySQL resource metrics are often graphed based on the results of SHOW GLOBAL STATUS etc. Each item in SHOW GLOBAL STATUS has two values: the value at that time (instantaneous value) and the cumulative value from when MySQL started to the present (cumulative value).
Since it is not clearly stated which value each corresponds to, we recommend using a pre-defined solution to obtain it.
Number of slow logs
Queries that take longer than the specified threshold to execute are recorded.
Current number of threads
The number of threads MySQL is currently running.
If this metric is significantly higher than normal, it is likely due to an application problem that is causing an abnormal number of connections, or the application is continually reconnecting because the database is slowing down for other reasons.
Number of queries issued per second
This can be achieved using the solution introduced in the following chapter.
Check how many SELECT and UPDATE requests are coming in per second.
If these have jumped significantly compared to before, suspect that there may be more access than expected,
or that the number of queries has increased due to an application bug.
Conversely, if the number of queries has not increased but latency has worsened, suspect that there may be an increase in single, heavy queries.
Query Latency
Check the average latency of the query itself and the number of slow queries that exceed a threshold.
These can also be achieved with the following solutions:
If the number of heavy queries is increasing, determine whether the increase in the number of queries (number of accesses) is causing the database itself to become heavy and slow, or whether the increase in accesses is not that great and the queries themselves are expensive, causing the database to become heavy.
Replication Lag in Seconds
The number of seconds replication is delayed.
Check whether replication is consistently delayed or if it is intermittently delayed and then converges.
In the former case, it is possible that a very heavy update query was executed or that there is a deadlock or other reason.
In the latter case, if the number of queries is increasing, it is possible that the performance of the DB slave is not keeping up.
Metrics Acquisition Solution
In the solution below,
SHOW ENGINE INNODB
In addition, some information is obtained from tables in the performance_schema.
Percona Monitoring and Management(PMP)
Solutions developed by Percona
There is a PMM Server that stores information and a PMM Client that collects it.
A complete set of resource information is provided on the dashboard.
Resource monitoring in SaaS
Mackerel provides a plugin for collecting MySQL resources.
Datadog includes MySQL resource collection functionality in the standard Datadog agent package.
If you are using AWS RDS (Aurora), enable Performance Insights.
Conclusion
SRG is looking for people to work with us.
If you're interested, please contact us here.