MySQL Monitoring Best Practices
This is Oniumi (@fat47) from the Service Reliability Group (SRG) of the Technology Headquarters.
#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.
This article is located in SRGDBWGWe will publish materials regarding the database that the (DB Working Group) provides to the entire company.
I hope this helps in some way.
Purpose of this documentThe value that should trigger the alertMySQL process alive monitoringMySQL process startup time monitoringMonitoring Replication StatusMonitoring the number of connectionsDisk space monitoringIs read_only turned on on the slave?Metrics to look out for and how to get themNumber of slow logsCurrent number of threadsQueries per secondQuery LatencyReplication Lag SecondsMetrics Acquisition SolutionPercona Monitoring and Management(PMP)SaaS Resource MonitoringConclusion
Purpose of this document
This article summarizes the key points of monitoring when operating MySQL.
We will also explain how to obtain metrics and each item.
The value that should trigger the alert
Set up alert triggers so that you can be notified as soon as a problem occurs with MySQL.
Warning
MySQL process alive monitoring
This checks whether the MySQL process itself is alive.
Enables detection of sudden process death due to OOM Killer or other reasons.
MySQL process startup time monitoring
You can check the time since MySQL was started by using the following command.
If the time here is short, it means that MySQL has been restarted.
Monitoring Replication Status
This is to monitor whether the slave is replicating normally.
Slave_SQL_Running
Seconds_Behind_Master
Monitoring the number of connections
Monitor the ratio (current number of connections) / (maximum number of connections).
If you exceed the limit, you will get a MySQL connection error "Too many connections."
Set a generous upper limit
Disk space monitoring
This involves monitoring the OS, but since data usage in databases often continues to increase, it is important to monitor it appropriately.
If disk usage reaches too close to its limit, the measures that can be taken will be limited, so it is a good idea to set up an alert to notify you when usage exceeds about 70%.
Is read_only turned on on the slave?
This setting is to prevent accidentally executing updates such as UPDATE on the slave database.
In the case of the MySQL root user, this read_only is ignored and UPDATE etc. will be executed, so do not use the root user for normal operations.
From MySQL5.7 onwards, a setting called `super_read_only` has been added that also prohibits updates by root.
Metrics to look out for and how to get them
Resource metrics help you investigate the cause of failures
By comparing metrics at three points in time, before the outage occurred, during the outage, and after the outage was resolved,
This will make it easier to pinpoint the cause.
MySQL resource metrics are often graphed based on the results of SHOW GLOBAL STATUS, etc. There are two types of values for each item in SHOW GLOBAL STATUS: the value at that time (instantaneous value) and the cumulative value from when MySQL started to the present (cumulative value).
Since it is not clear which value each corresponds to, we recommend using a pre-prepared solution to obtain them.
Number of slow logs
Queries that take longer than the specified threshold to execute are logged.
Current number of threads
The number of threads that MySQL is currently running.
If this metric is significantly higher than normal, it is likely due to an application problem that is causing an abnormal number of connections to be made, or the application is continually reconnecting because the database is slowing down for other reasons.
Queries per second
This can be achieved using the solution presented in the following chapter.
Check how many SELECTs and UPDATEs are being made per second.
If these have jumped significantly compared to before, suspect that there is more access than expected,
or that a bug in the application has increased the number of queries.
Conversely, if the number of queries has not increased but latency has worsened, suspect the possibility that there is an increase in single, heavy queries.
Query Latency
Check the average latency of the queries themselves and the number of slow queries that exceed a threshold.
These can also be achieved with the following solutions:
If the number of heavy queries is increasing, determine whether an increase in the number of queries (number of accesses) is making the database itself heavier and slower, or whether accesses are not increasing that much and the queries themselves are expensive, causing the database to become heavy.
Replication Lag Seconds
The number of seconds replication is delayed.
Check whether replication is consistently delayed or if it is intermittently delayed and then converges.
In the former case, possible reasons include a very heavy update query being executed or a deadlock.
In the latter case, if the number of queries is increasing, it is possible that the DB slave's performance is not keeping up.
Metrics Acquisition Solution
In the following solution:
SHOW ENGINE INNODB
Some information is also obtained from tables in the performance_schema.
Percona Monitoring and Management(PMP)
Solutions developed by Percona
There is the PMM Server that stores information and the PMM Client that collects it.
A range of resource information is provided on the dashboard.
SaaS Resource Monitoring
Mackerel provides a plugin for collecting MySQL resources.
Datadog includes MySQL resource collection functionality in the standard Datadog agent package.
If you use AWS RDS (Aurora), enable Performance Insights.
Conclusion
SRG is looking for people to work with us. If you are interested, please contact us here.