⚠️⚠️⚠️ STG SRG Portal ⚠️⚠️⚠️/Technical article/MySQL Monitoring Best Practices

MySQL Monitoring Best Practices

2024/11/28 19:182024/11/28 19:18

This is Onikai (@fat47) from the Service Reliability Group (SRG) of the Technology Division.

#SRGThe Service Reliability Group primarily provides comprehensive support for the infrastructure surrounding our media services, focusing on improving existing services, launching new ones, and contributing to open-source software (OSS).

This article is located within SRG.DBWGThe DB Working Group is releasing database-related materials that it provides to the entire company.

I hope this is of some help.

The purpose of this document Values to use as alert triggers MySQL process health monitoring Monitoring MySQL process startup time Monitoring replication status Monitoring the number of connections Monitoring remaining disk space Is read_only enabled on the slave device?Key metrics and how to obtain them Slow log entries Current number of threads Number of queries issued per second Query latency Replication delay in seconds Metrics acquisition solution Percona Monitoring and Management(PMP)Resource Monitoring in SaaS In conclusion

The purpose of this document

This document summarizes key points for monitoring MySQL during its operation.

This section explains how to obtain metrics and describes each of the metrics items.

Values to use as alert triggers

Set up alert triggers so you can be notified as soon as a problem occurs in MySQL.

Warning

MySQL process health monitoring

This monitors whether the MySQL process itself is still running.

This allows for detection of sudden process deaths due to reasons such as OOM Killer.

Monitoring MySQL process startup time

You can use the following command to check the time elapsed since MySQL started.

The short duration here indicates that a MySQL restart occurred.

Monitoring replication status

This monitors whether the slave has successfully established replication.

Slave_SQL_Running

Seconds_Behind_Master

Monitoring the number of connections

Monitor the ratio of (current number of connections) ÷ (maximum number of connections).

If the limit is exceeded, a MySQL connection error, "Too many connections," will occur.

Let's set a generous upper limit.

Monitoring remaining disk space

While this involves monitoring the OS, databases often experience continuous data usage increases, so it's important to monitor them appropriately.

If disk usage gets too close to the limit, the available solutions become limited, so it's best to set up an alert to notify you when usage exceeds around 70%.

Is read_only enabled on the slave device?

This setting prevents accidental updates such as UPDATE statements from being executed on the slave database.

If you are using the MySQL root user, this `read_only` directive will be ignored and UPDATE statements and other operations will be executed, so you should avoid using the root user for normal operational tasks.

💡

MySQL 5.7 and later versions also include a setting called `super_read_only` which prohibits updates even by root.

Key metrics and how to obtain them

Resource metrics are useful for investigating the root cause of failures.

By comparing metrics at three points in time—before the failure, during the failure, and after the failure is resolved—

This makes it easier to identify the cause.

MySQL resource metrics are often graphed based on results from commands like SHOW GLOBAL STATUS. Each item in SHOW GLOBAL STATUS has two types of values: the value at a specific point in time (instantaneous value) and the cumulative value from when MySQL was started until the present (cumulative value).

Since it's not explicitly stated which value each one represents, we recommend using a pre-prepared solution to obtain them.

Slow log entries

Queries that took longer than the specified execution threshold are recorded.

Current number of threads

This is the number of threads currently running in MySQL.

💡

If this metric is significantly higher than normal, it's often due to an application malfunction causing an abnormal number of connections, or because the database is slowing down for other reasons, causing the application to continuously reconnect.

Number of queries issued per second

This can be obtained using the solutions described in the following chapter.

We will check how many SELECT and UPDATE operations are coming in per second.

💡

If these figures have jumped significantly compared to before, suspect either that there is more traffic than expected or that the number of queries is increasing due to an application bug. Conversely, if the number of queries hasn't increased but latency has worsened, suspect that there may be an increase in single, heavy queries.

Query latency

This is checked by the average latency of the queries themselves and the number of slow queries that exceed a threshold.

These can also be obtained using the solutions described below.

If you're experiencing an increase in heavy queries, determine whether the database itself is slowing down due to an increase in the number of queries (accesses), or whether the database is slowing down because the queries themselves are costly and not significantly increasing access.