HOME/Technical Articles/Preventing problems with MongoDB Atlas and Cloud Manager! Summary of trouble cases and preventative measures

Preventing problems with MongoDB Atlas and Cloud Manager! Summary of trouble cases and preventative measures

2025/8/5 16:362025/8/5 17:11

This is Kobayashi (@berlinbytes) from the Service Reliability Group (SRG) of the Media Headquarters.

#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.

In this article, we will introduce potential problems and preventative measures, including case studies, based on knowledge gained from operating MongoDB to date.

Introduction Learning from past troubles and lurking risks Version-related issues Backup troubles Misconfigured collections or indexes Specific preventative measures for stable operation First, follow the version Put the right alerts in place For MongoDB Atlas For MongoDB Cloud Manager List of alerts to set in MongoDB Atlas List of alerts to be set in MongoDB Cloud Manager Metrics to capture using external tools such as Datadog Host units DB unit Replication (ReplicaSet) units Conclusion

Introduction

Our company has been using MongoDB since the early stages of social games.

Its history began with operations in physical enclosures, then evolved into more highly concentrated hosts, and then into servers equipped with high-speed storage such as SSDs and NVMe.

Currently, it is mainly used in virtual instances on private clouds and public clouds, as well as in managed services.

MongoDB is a NoSQL database that allows flexible data management using JSON-like documents.

With the introduction of management tools such as MongoDB Atlas and Cloud Manager, the points that need to be focused on during operations have changed.

Atlas Database

Find out how the document model eliminates operational complexity while ensuring unmatched resilience, scalability, and enterprise-grade security through the Atlas cloud database.

https://www.mongodb.com/products/platform/atlas-database

MongoDB Cloud Manager

Cloud Manager is the cloud-based management platform that enables you to deploy, monitor, back up, and scale MongoDB anywhere.

https://www.mongodb.com/products/tools/cloud-manager

Learning from past troubles and lurking risks

Most of the problems we've encountered so far may still occur today. Excluding issues that only occur with certain versions, we've divided the main problems into three categories.

Version-related issues

There have been cases where MongoDB itself and the MongoDB Agent, which is essential for operation with Cloud Manager, have become outdated and are no longer supported.

The MongoDB Agent is a key component that provides monitoring, backup, and automation capabilities in a single binary.

Once support ends, you will need to manually upgrade your instances by connecting to them directly via SSH, which will increase your operational burden.

Backup troubles

Version issues also affect backups.

Backups may no longer be possible for unsupported (EOL) versions of MongoDB.

Additionally, manual errors can sometimes lead to the need to resync backups.

Misconfigured collections or indexes

Database design mistakes can seriously impact performance.

In particular, the following misconfigurations can cause abnormally high loads on the entire cluster or specific instances:

Intensive access to non-sharded collections

Sharding is a horizontal partitioning technique that distributes data across multiple servers.

Inefficient query execution without using indexes

A query is a request or command to a database.
If indexes are not properly configured, data searches take a long time and slow queries occur.

Designs that cause data bias

For example, if sharding is performed using master data with a small amount of data as a key, the data will be biased towards a specific shard (distributed server), resulting in a concentrated load.

These problems can be summarized into the following three points:

Failures and their symptoms may be overlooked due to inadequate monitoring and alerts

Late updates to the Agent and binary versions can result in out-of-support and inability to use new features.

Misconfiguration of collections or indexes can cause high loads and slow queries.

Specific preventative measures for stable operation

To prevent the above problems from occurring, it is important to take preventative measures on a daily basis.

First, follow the version

MongoDB currently releases major versions once a year.

It is not necessary to update immediately, but it is recommended to consider the EOL (End of Life) policy and periodically determine whether or not to update the version.

In particular, for patch releases that include security fixes and bug fixes, it is recommended that you make upgrading to the latest version your basic policy.

MongoDB's own EOL policy

EOL Policy for MongoDB Cloud Manager

Put the right alerts in place

Having an alert system in place is important for early detection of problems.

Here we will introduce some typical alerts that should be set in MongoDB Atlas and Cloud Manager.

For MongoDB Atlas

From the Atlas UI, set up an alert with the following conditions:

Backup troubles

Snapshot failed
Snapshot schedule fell behind

Misconfigured collections or indexes

Query Targeting: Scanned Objects / Returned
Host has index suggestions

For MongoDB Cloud Manager

From the Cloud Manager UI, set up alerts for the following conditions:

Version-related

Monitoring does not have the latest version
Host does not have the latest version

Backup troubles

Backup does not have the latest version
Backup is down
Backup requires a resync
Backup oplog is behind

Misconfigured collections or indexes

Host has index suggestions

List of alerts to set in MongoDB Atlas

In addition to the alerts mentioned above, we recommend the following settings as best practices:

Atlas Auto Scaling

Compute auto-scaling initiated for base tier
Compute auto-scaling initiated for analytics tier
Disk auto-scaling initiated

Backup

Snapshot failed
Snapshot schedule fell behind

Billing

Credit card is about to expire

Maintenance Window

Maintenance is scheduled
Maintenance started

Host

Host has index suggestions
System: CPU (User) % above 95
Query Targeting: Scanned Objects / Returned above 1000
Disk space % used on Data Partition above 90
Connections % of configured limit above 80

Limit

An overall request rate limit has been hit

Replica Set

Replica set has no primary
Replication Oplog Window is below 1 hours

List of alerts to set in MongoDB Cloud Manager

We recommend the following settings for your Cloud Manager environment:

Agent

Monitoring is down
Backup is down
Monitoring does not have the latest version
Backup does not have the latest version

Backup

Backup oplog is behind
Backup requires a resync

Billing

Credit card is about to expire

Host

Host is down
Host is exposed to the public Internet
Host is recovering
Host does not have the latest version

Replica Set

Number of healthy members is...

Metrics to capture using external tools such as Datadog

In addition to the standard features of Atlas and Cloud Manager, you can use external monitoring tools such as Datadog and Percona Monitoring and Management (PMM) to enable more flexible alert settings and early detection of problems.

Here are some key metrics you should capture:

Host units

CPU: Process CPU usage

Memory: Memory usage

Network I/O Usage: Network usage

Disk Usage / I/O UsageDisk usage and I/O volume

Page Faults: Page fault occurrence status

DB unit

Read Request (query, getmore): Increase or decrease in read requests

Write Request (insert, update, delete): Increase or decrease in write requests

Assertions: Number of assertion errors

Replication (ReplicaSet) units

Connections: Number of connections

Oplog size / Oplog window size: Operation log size and retention period

Replication LagReplication delay time

Read / Write tickets: Read/write waiting status

Lock (Read / Write): Lock occurrence status

Conclusion

To ensure stable operation of MongoDB, it is extremely important to keep up with versions, maintain healthy backups, and set appropriate alerts.

In particular, version control and monitoring can prevent many problems before they occur.

Furthermore, by combining it with external tools like Datadog, you can monitor based on more detailed metrics and quickly resolve problems, resulting in more efficient and sophisticated operations.

We also hope to introduce some useful third-party tools, such as Datadog Database Monitoring for MongoDB and Percona Monitoring and Management.

Database Monitoring

Learn about Database Monitoring and get started

https://docs.datadoghq.com/database_monitoring/

Percona Monitoring and Management

Info: This is the documentation for the latest PMM 3 release. For details, see the PMM 3.3.1 release notes.

https://docs.percona.com/percona-monitoring-and-management/3/index.html

If you are interested in SRG, please contact us here.

Recruitment Information - CyberAgent SRG #ca_srg

About SRG SRG (Service Reliability Group) is working to improve reliability by promoting the introduction of SRE to the media business as a cross-sectional SRE, based on the vision of "improving reliability across the media business." The work is centered around the following three areas: Consolidating and deploying the technical know-how of each business

https://ca-srg.dev/careers