HOME/Articles/MySQL Monitoring Best Practices

MySQL Monitoring Best Practices

2023/1/19 11:242024/9/20 9:04

This is Oniumi (@fat47) from the Service Reliability Group (SRG) of the Technology Headquarters.

#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.

This article is located in SRGDBWGWe will publish materials regarding the database that the (DB Working Group) provides to the entire company.

I hope this helps in some way.

Purpose of this document The value that should trigger the alert MySQL process alive monitoring MySQL process startup time monitoring Monitoring Replication Status Monitoring the number of connections Disk space monitoring Is read_only turned on on the slave?Metrics to look out for and how to get them Number of slow logs Current number of threads Queries per second Query Latency Replication Lag Seconds Metrics Acquisition Solution Percona Monitoring and Management(PMP)SaaS Resource Monitoring Conclusion

Purpose of this document

This article summarizes the key points of monitoring when operating MySQL.

We will also explain how to obtain metrics and each item.

The value that should trigger the alert

Set up alert triggers so that you can be notified as soon as a problem occurs with MySQL.

Warning

MySQL process alive monitoring

This checks whether the MySQL process itself is alive.

Enables detection of sudden process death due to OOM Killer or other reasons.

MySQL process startup time monitoring

You can check the time since MySQL was started by using the following command.

If the time here is short, it means that MySQL has been restarted.

Monitoring Replication Status

This is to monitor whether the slave is replicating normally.

Slave_SQL_Running

Seconds_Behind_Master

Monitoring the number of connections

Monitor the ratio (current number of connections) / (maximum number of connections).

If you exceed the limit, you will get a MySQL connection error "Too many connections."

Set a generous upper limit

Disk space monitoring

This involves monitoring the OS, but since data usage in databases often continues to increase, it is important to monitor it appropriately.

If disk usage reaches too close to its limit, the measures that can be taken will be limited, so it is a good idea to set up an alert to notify you when usage exceeds about 70%.

Is read_only turned on on the slave?

This setting is to prevent accidentally executing updates such as UPDATE on the slave database.

In the case of the MySQL root user, this read_only is ignored and UPDATE etc. will be executed, so do not use the root user for normal operations.

💡

From MySQL5.7 onwards, a setting called `super_read_only` has been added that also prohibits updates by root.

Metrics to look out for and how to get them

Resource metrics help you investigate the cause of failures

By comparing metrics at three points in time, before the outage occurred, during the outage, and after the outage was resolved,

This will make it easier to pinpoint the cause.

MySQL resource metrics are often graphed based on the results of SHOW GLOBAL STATUS, etc. There are two types of values for each item in SHOW GLOBAL STATUS: the value at that time (instantaneous value) and the cumulative value from when MySQL started to the present (cumulative value).

Since it is not clear which value each corresponds to, we recommend using a pre-prepared solution to obtain them.

Number of slow logs

Queries that take longer than the specified threshold to execute are logged.

Current number of threads

The number of threads that MySQL is currently running.

💡

If this metric is significantly higher than normal, it is likely due to an application problem that is causing an abnormal number of connections to be made, or the application is continually reconnecting because the database is slowing down for other reasons.

Queries per second

This can be achieved using the solution presented in the following chapter.

Check how many SELECTs and UPDATEs are being made per second.

💡

If these have jumped significantly compared to before, suspect that there is more access than expected, or that a bug in the application has increased the number of queries. Conversely, if the number of queries has not increased but latency has worsened, suspect the possibility that there is an increase in single, heavy queries.

Query Latency

Check the average latency of the queries themselves and the number of slow queries that exceed a threshold.

These can also be achieved with the following solutions:

If the number of heavy queries is increasing, determine whether an increase in the number of queries (number of accesses) is making the database itself heavier and slower, or whether accesses are not increasing that much and the queries themselves are expensive, causing the database to become heavy.

Replication Lag Seconds

The number of seconds replication is delayed.

Check whether replication is consistently delayed or if it is intermittently delayed and then converges.

In the former case, possible reasons include a very heavy update query being executed or a deadlock.

In the latter case, if the number of queries is increasing, it is possible that the DB slave's performance is not keeping up.

Metrics Acquisition Solution

In the following solution:

SHOW ENGINE INNODB

Some information is also obtained from tables in the performance_schema.

Percona Monitoring and Management(PMP)

Solutions developed by Percona

There is the PMM Server that stores information and the PMM Client that collects it.

A range of resource information is provided on the dashboard.

Percona Monitoring and Management

Monitor the health of your database infrastructure, explore new patterns in database behavior, and manage and improve the performance of your databases no matter where they are located or deployed.

https://www.percona.com/software/database-tools/percona-monitoring-and-management

SaaS Resource Monitoring

Mackerel provides a plugin for collecting MySQL resources.

mackerel-agent-plugins/mackerel-plugin-mysql at master · mackerelio/mackerel-agent-plugins

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or window. Reload to refresh your session. Reload to refresh your session.

https://github.com/mackerelio/mackerel-agent-plugins/tree/master/mackerel-plugin-mysql

Datadog includes MySQL resource collection functionality in the standard Datadog agent package.

Get started with Datadog

The Datadog Agent can collect a number of metrics from your MySQL databases, including (among others): Query throughput Query performance, such as average query execution time and slow queries Connections, such as currently open connections, aborted connections, and errors InnoDB, such as buffer pool metrics You can also create your own metrics using custom SQL queries. Note: This integration is also compatible with MariaDB, as MariaDB is a "compatible product" of MySQL. MySQL checks are included in the Datadog Agent package; no additional installation is required on your MySQL servers. Preparing MySQL On each MySQL server, create a database user for the Datadog Agent. The following steps grant the Agent permission to log in from any host using datadog@'%' . You can restrict the datadog user to only log in from localhost by using datadog@'localhost' . For more information, see Adding MySQL Accounts, Assigning Privileges, and Deleting Accounts. For mySQL 8.0+, create a datadog user using the native password hashing method. Verify that the user was created successfully with the following command, replacing with the password you created above: The Agent requires some permissions to collect metrics. Grant only limited permissions to the user, as follows: For MySQL 8.0 and later, set max_user_connections as follows: Once enabled, you can collect metrics from the performance_schema database by granting additional permissions. Follow the steps below to configure this check for an Agent running on a host. For container environments, see the Docker, Kubernetes, or ECS sections. To configure this check for an Agent running on a host: Agent Available in version 6.0 and later Run the Agent status subcommand and look for mysql in the Checks section. mysql.performance.user_connections (gauge) The number of user connections.

https://docs.datadoghq.com/ja/integrations/mysql/?tab=host

If you use AWS RDS (Aurora), enable Performance Insights.

Monitoring DB Load with Performance Insights on Amazon RDS

Monitoring DB Load with Performance Insights on Amazon RDS - Amazon Relational Database Service Performance Insights extends existing Amazon RDS monitoring capabilities to make database performance more visible and easier to analyze. You can use the Performance Insights dashboard to visualize database load on your Amazon RDS DB instances and filter by load waits, SQL statements, hosts, and users. For information about using Performance Insights with Amazon DocumentDB, see the Amazon DocumentDB Developer Guide. JavaScript is disabled or unavailable in your browser. To use the AWS documentation, you must enable JavaScript. For instructions, see the help pages for your browser.

https://docs.aws.amazon.com/ja_jp/AmazonRDS/latest/UserGuide/USER_PerfInsights.html

Conclusion

SRG is looking for people to work with us. If you are interested, please contact us here.

Recruitment information - CyberAgent SRG #ca_srg

About SRG SRG (Service Reliability Group) is working to improve reliability by promoting the introduction of SREs to the media business as a cross-sectional SRE under the vision of "improving reliability across the media business." The work is centered around the following three pillars: Consolidating and deploying the technical know-how of each business

https://ca-srg.dev/0d353a3a68ba4fb0a2e3443d7009a22b