Preventing problems with MongoDB Atlas and Cloud Manager! Summary of trouble cases and preventative measures

This is Kobayashi (@berlinbytes) from the Service Reliability Group (SRG) of the Media Headquarters.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
In this article, we will introduce potential problems and preventative measures, including case studies, based on knowledge gained from operating MongoDB to date.
 

Introduction


Our company has been using MongoDB since the early stages of social games.
Its history began with operations in physical enclosures, then evolved into more highly concentrated hosts, and then into servers equipped with high-speed storage such as SSDs and NVMe.
Currently, it is mainly used in virtual instances on private clouds and public clouds, as well as in managed services.
MongoDB is a NoSQL database that allows flexible data management using JSON-like documents.
With the introduction of management tools such as MongoDB Atlas and Cloud Manager, the points that need to be focused on during operations have changed.

Learning from past troubles and lurking risks


Most of the problems we've encountered so far may still occur today. Excluding issues that only occur with certain versions, we've divided the main problems into three categories.

Version-related issues

There have been cases where MongoDB itself and the MongoDB Agent, which is essential for operation with Cloud Manager, have become outdated and are no longer supported.
The MongoDB Agent is a key component that provides monitoring, backup, and automation capabilities in a single binary.
Once support ends, you will need to manually upgrade your instances by connecting to them directly via SSH, which will increase your operational burden.

Backup troubles

Version issues also affect backups.
Backups may no longer be possible for unsupported (EOL) versions of MongoDB.
Additionally, manual errors can sometimes lead to the need to resync backups.

Misconfigured collections or indexes

Database design mistakes can seriously impact performance.
In particular, the following misconfigurations can cause abnormally high loads on the entire cluster or specific instances:
  • Intensive access to non-sharded collections
    • Sharding is a horizontal partitioning technique that distributes data across multiple servers.
  • Inefficient query execution without using indexes
    • A query is a request or command to a database.
    • If indexes are not properly configured, data searches take a long time and slow queries occur.
  • Designs that cause data bias
    • For example, if sharding is performed using master data with a small amount of data as a key, the data will be biased towards a specific shard (distributed server), resulting in a concentrated load.
These problems can be summarized into the following three points:
  • Failures and their symptoms may be overlooked due to inadequate monitoring and alerts
  • Late updates to the Agent and binary versions can result in out-of-support and inability to use new features.
  • Misconfiguration of collections or indexes can cause high loads and slow queries.

Specific preventative measures for stable operation


To prevent the above problems from occurring, it is important to take preventative measures on a daily basis.

First, follow the version

MongoDB currently releases major versions once a year.
It is not necessary to update immediately, but it is recommended to consider the EOL (End of Life) policy and periodically determine whether or not to update the version.
In particular, for patch releases that include security fixes and bug fixes, it is recommended that you make upgrading to the latest version your basic policy.

Put the right alerts in place

Having an alert system in place is important for early detection of problems.
Here we will introduce some typical alerts that should be set in MongoDB Atlas and Cloud Manager.

For MongoDB Atlas

From the Atlas UI, set up an alert with the following conditions:
  • Backup troubles
    • Snapshot failed
    • Snapshot schedule fell behind
  • Misconfigured collections or indexes
    • Query Targeting: Scanned Objects / Returned
    • Host has index suggestions

For MongoDB Cloud Manager

From the Cloud Manager UI, set up alerts for the following conditions:
  • Version-related
    • Monitoring does not have the latest version
    • Host does not have the latest version
  • Backup troubles
    • Backup does not have the latest version
    • Backup is down
    • Backup requires a resync
    • Backup oplog is behind
  • Misconfigured collections or indexes
    • Host has index suggestions

List of alerts to set in MongoDB Atlas


In addition to the alerts mentioned above, we recommend the following settings as best practices:
  • Atlas Auto Scaling
    • Compute auto-scaling initiated for base tier
    • Compute auto-scaling initiated for analytics tier
    • Disk auto-scaling initiated
  • Backup
    • Snapshot failed
    • Snapshot schedule fell behind
  • Billing
    • Credit card is about to expire
  • Maintenance Window
    • Maintenance is scheduled
    • Maintenance started
  • Host
    • Host has index suggestions
    • System: CPU (User) % above 95
    • Query Targeting: Scanned Objects / Returned above 1000
    • Disk space % used on Data Partition above 90
    • Connections % of configured limit above 80
  • Limit
    • An overall request rate limit has been hit
  • Replica Set
    • Replica set has no primary
    • Replication Oplog Window is below 1 hours

List of alerts to set in MongoDB Cloud Manager


We recommend the following settings for your Cloud Manager environment:
  • Agent
    • Monitoring is down
    • Backup is down
    • Monitoring does not have the latest version
    • Backup does not have the latest version
  • Backup
    • Backup oplog is behind
    • Backup requires a resync
  • Billing
    • Credit card is about to expire
  • Host
    • Host is down
    • Host is exposed to the public Internet
    • Host is recovering
    • Host does not have the latest version
  • Replica Set
    • Number of healthy members is...

Metrics to capture using external tools such as Datadog


In addition to the standard features of Atlas and Cloud Manager, you can use external monitoring tools such as Datadog and Percona Monitoring and Management (PMM) to enable more flexible alert settings and early detection of problems.
Here are some key metrics you should capture:

Host units

  • CPU: Process CPU usage
  • Memory: Memory usage
  • Network I/O Usage: Network usage
  • Disk Usage / I/O UsageDisk usage and I/O volume
  • Page Faults: Page fault occurrence status

DB unit

  • Read Request (query, getmore): Increase or decrease in read requests
  • Write Request (insert, update, delete): Increase or decrease in write requests
  • Assertions: Number of assertion errors

Replication (ReplicaSet) units

  • Connections: Number of connections
  • Oplog size / Oplog window size: Operation log size and retention period
  • Replication LagReplication delay time
  • Read / Write tickets: Read/write waiting status
  • Lock (Read / Write): Lock occurrence status

Conclusion


To ensure stable operation of MongoDB, it is extremely important to keep up with versions, maintain healthy backups, and set appropriate alerts.
In particular, version control and monitoring can prevent many problems before they occur.
Furthermore, by combining it with external tools like Datadog, you can monitor based on more detailed metrics and quickly resolve problems, resulting in more efficient and sophisticated operations.
We also hope to introduce some useful third-party tools, such as Datadog Database Monitoring for MongoDB and Percona Monitoring and Management.
 
If you are interested in SRG, please contact us here.