Preventing problems with MongoDB Atlas and Cloud Manager! Summary of trouble cases and preventative measures
This is Kobayashi (@berlinbytes) from the Service Reliability Group (SRG) of the Media Headquarters.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
In this article, we will introduce potential problems and preventative measures, including case studies, based on knowledge gained from operating MongoDB to date.
IntroductionLearning from past troubles and lurking risksVersion-related issuesBackup troublesMisconfigured collections or indexesSpecific preventative measures for stable operationFirst, follow the versionPut the right alerts in placeFor MongoDB AtlasFor MongoDB Cloud ManagerList of alerts to set in MongoDB AtlasList of alerts to be set in MongoDB Cloud ManagerMetrics to capture using external tools such as DatadogHost unitsDB unitReplication (ReplicaSet) unitsConclusion
Introduction
Our company has been using MongoDB since the early stages of social games.
Its history began with operations in physical enclosures, then evolved into more highly concentrated hosts, and then into servers equipped with high-speed storage such as SSDs and NVMe.
Currently, it is mainly used in virtual instances on private clouds and public clouds, as well as in managed services.
MongoDB is a NoSQL database that allows flexible data management using JSON-like documents.
With the introduction of management tools such as MongoDB Atlas and Cloud Manager, the points that need to be focused on during operations have changed.
Learning from past troubles and lurking risks
Most of the problems we've encountered so far may still occur today.
Excluding issues that only occur with certain versions, we've divided the main problems into three categories.
Version-related issues
There have been cases where MongoDB itself and the MongoDB Agent, which is essential for operation with Cloud Manager, have become outdated and are no longer supported.
The MongoDB Agent is a key component that provides monitoring, backup, and automation capabilities in a single binary.
Once support ends, you will need to manually upgrade your instances by connecting to them directly via SSH, which will increase your operational burden.
Backup troubles
Version issues also affect backups.
Backups may no longer be possible for unsupported (EOL) versions of MongoDB.
Additionally, manual errors can sometimes lead to the need to resync backups.
Misconfigured collections or indexes
Database design mistakes can seriously impact performance.
In particular, the following misconfigurations can cause abnormally high loads on the entire cluster or specific instances:
- Intensive access to non-sharded collections
- Sharding is a horizontal partitioning technique that distributes data across multiple servers.
- Inefficient query execution without using indexes
- A query is a request or command to a database.
- If indexes are not properly configured, data searches take a long time and slow queries occur.
- Designs that cause data bias
- For example, if sharding is performed using master data with a small amount of data as a key, the data will be biased towards a specific shard (distributed server), resulting in a concentrated load.
These problems can be summarized into the following three points:
- Failures and their symptoms may be overlooked due to inadequate monitoring and alerts
- Late updates to the Agent and binary versions can result in out-of-support and inability to use new features.
- Misconfiguration of collections or indexes can cause high loads and slow queries.
Specific preventative measures for stable operation
To prevent the above problems from occurring, it is important to take preventative measures on a daily basis.
First, follow the version
MongoDB currently releases major versions once a year.
It is not necessary to update immediately, but it is recommended to consider the EOL (End of Life) policy and periodically determine whether or not to update the version.
In particular, for patch releases that include security fixes and bug fixes, it is recommended that you make upgrading to the latest version your basic policy.
Put the right alerts in place
Having an alert system in place is important for early detection of problems.
Here we will introduce some typical alerts that should be set in MongoDB Atlas and Cloud Manager.
For MongoDB Atlas
From the Atlas UI, set up an alert with the following conditions:
- Backup troubles
Snapshot failed
Snapshot schedule fell behind
- Misconfigured collections or indexes
Query Targeting: Scanned Objects / Returned
Host has index suggestions
For MongoDB Cloud Manager
From the Cloud Manager UI, set up alerts for the following conditions:
- Version-related
Monitoring does not have the latest version
Host does not have the latest version
- Backup troubles
Backup does not have the latest version
Backup is down
Backup requires a resync
Backup oplog is behind
- Misconfigured collections or indexes
Host has index suggestions
List of alerts to set in MongoDB Atlas
In addition to the alerts mentioned above, we recommend the following settings as best practices:
- Atlas Auto Scaling
Compute auto-scaling initiated for base tier
Compute auto-scaling initiated for analytics tier
Disk auto-scaling initiated
- Backup
Snapshot failed
Snapshot schedule fell behind
- Billing
Credit card is about to expire
- Maintenance Window
Maintenance is scheduled
Maintenance started
- Host
Host has index suggestions
System: CPU (User) % above 95
Query Targeting: Scanned Objects / Returned above 1000
Disk space % used on Data Partition above 90
Connections % of configured limit above 80
- Limit
An overall request rate limit has been hit
- Replica Set
Replica set has no primary
Replication Oplog Window is below 1 hours
List of alerts to set in MongoDB Cloud Manager
We recommend the following settings for your Cloud Manager environment:
- Agent
Monitoring is down
Backup is down
Monitoring does not have the latest version
Backup does not have the latest version
- Backup
Backup oplog is behind
Backup requires a resync
- Billing
Credit card is about to expire
- Host
Host is down
Host is exposed to the public Internet
Host is recovering
Host does not have the latest version
- Replica Set
Number of healthy members is...
Metrics to capture using external tools such as Datadog
In addition to the standard features of Atlas and Cloud Manager, you can use external monitoring tools such as Datadog and Percona Monitoring and Management (PMM) to enable more flexible alert settings and early detection of problems.
Here are some key metrics you should capture:
Host units
- CPU: Process CPU usage
- Memory: Memory usage
- Network I/O Usage: Network usage
- Disk Usage / I/O UsageDisk usage and I/O volume
- Page Faults: Page fault occurrence status
DB unit
- Read Request (query, getmore): Increase or decrease in read requests
- Write Request (insert, update, delete): Increase or decrease in write requests
- Assertions: Number of assertion errors
Replication (ReplicaSet) units
- Connections: Number of connections
- Oplog size / Oplog window size: Operation log size and retention period
- Replication LagReplication delay time
- Read / Write tickets: Read/write waiting status
- Lock (Read / Write): Lock occurrence status
Conclusion
To ensure stable operation of MongoDB, it is extremely important to keep up with versions, maintain healthy backups, and set appropriate alerts.
In particular, version control and monitoring can prevent many problems before they occur.
Furthermore, by combining it with external tools like Datadog, you can monitor based on more detailed metrics and quickly resolve problems, resulting in more efficient and sophisticated operations.
We also hope to introduce some useful third-party tools, such as Datadog Database Monitoring for MongoDB and Percona Monitoring and Management.
If you are interested in SRG, please contact us here.