Prevent response degradation before it happens! Redis/Valkey monitoring basics and important metrics
This is Kobayashi (@berlinbytes) from the Service Reliability Group (SRG) of the Media Headquarters.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
This article explains the monitoring concepts essential for stable operation of Redis/Valkey, and introduces specific metrics and practical steps to prevent response degradation.
Introduction
At our company, we use Redis and its open source fork, Valkey, as extremely fast in-memory data stores for a wide range of purposes, including caching, session stores, real-time analytics, and primary databases.
Its distinctive feature is its extremely low latency, achieved by directly manipulating data in memory.
However, because of its speed, even the slightest performance degradation can affect the overall system responsiveness and degrade the user experience.
In this article, I will summarize the monitoring concepts essential for stable operation of Redis and Valkey, and introduce specific metrics and practical steps to prevent response degradation.
The Five Key Layers of Monitoring
To build an effective monitoring system, it is useful to think of the system as being divided into multiple layers.
Here we will explain the five monitoring layers you should keep in mind.
1. Performance (Latency & Throughput)
Performance is the metric that most directly impacts user experience.
It is important to measure both the round trip latency from the application's perspective and the delay in processing commands within the server.
Monitor commands processed per second along with cache hit rates to get a complete picture of performance.
Redis/Valkey typically responds in sub-milliseconds, so even a small increase in latency signals a significant change.
Also, since Redis/Valkey are essentially single-threaded, CPU usage is also an important indicator of server performance.
LATENCY HISTORY
2. Memory Health
For in-memory databases like Redis/Valkey, memory is the most important resource.
evicted_keys
A fragmentation ratio consistently above 1.5 is a sign of excessive fragmentation.
Conversely, if the value remains below 1.0, there is a possibility that swapping is occurring due to the OS, and caution is advised.
3. Replication and Clusters
If you use replication or cluster configuration to increase availability, monitoring its health is also essential.
It monitors the data synchronization delay (replication lag) between the primary and replica, the difference in synchronization position (offset), the connection status (link status), etc.
Managed services such as AWS ElastiCache and Google Cloud Memorystore often provide these metrics as standard.
4. The effects of persistence and forking
When using snapshots (RDB) or append-only files (AOF) for data persistence, background save operations can affect performance.
fork()
By recording a flag indicating the status of background saving processing and the time it took to process, it is useful for isolating performance issues when they occur.
5. Connect to the client
The connection status from the client is also an important subject to monitor.
Monitor the number of currently connected clients, the number of clients waiting due to a blocking command, and the number of refused connections.
If connection refusals are occurring, the maximum number of configured clients may have been reached, and you will need to review your resources and check the connection pooling settings on the application side.
Top metrics and alerts to monitor
To get started with monitoring, we will introduce some core metrics that are particularly important and examples of setting up alerts to detect abnormalities.
Core Metrics List
- memory
maxmemory
mem_fragmentation_ratio
evicted_keys
active_defrag_*
- CPU usage
- Main thread CPU usage
- Be careful when under heavy load
- activity
connected_clients
- Replication Lag
- Latency
- The delay in processing a command, measured in conjunction with the response time from the application.
- Cache efficiency
- Cache hit rate, which indicates the efficiency of the cache.
- network
- The round-trip time (RTT) between the client and the instance.
Practical alert configuration examples
LATENCY
evicted_keys
- Memory pressure: When memory usage exceeds 80% and fragmentation remains high.
- Replication Lag: When replication lag continues beyond a predefined tolerance (e.g. 1-3 seconds).
blocked_clients
How to create an effective monitoring dashboard
Rather than simply collecting metrics, creating dashboards that provide an at-a-glance understanding of the situation will greatly improve operational efficiency.
Below is an example of a dashboard configuration.
PING
SET
evicted_keys
- Throughput panel: Displays the number of commands processed per second (OPS), network traffic, and the amount of data transferred via replication, allowing you to understand load trends.
- Client Panel: Monitor trends such as number of connected clients, number of blocked clients, number of rejected connections, and number of authentication failures.
- Cache Efficiency Panel: Displays the cache hit rate in a separate large panel, allowing you to detect any sudden drops early.
Action plan for implementing surveillance
Finally, we will introduce a specific action plan for implementing monitoring from now on.
Establishing a Baseline
First, understand the normal behavior of the system.
Record key metrics such as latency, memory usage, OPS, and number of connections over a period of 1-2 weeks to establish trends (baselines) by time of day and day of the week.
Setting up an initial alert
Use baselines to set alerts with realistic thresholds.
For example, start with a latency of "baseline + 5ms," memory usage of "80%" or "90%," and set connection refusals and client blocks to notify you immediately upon occurrence.
Establishing an analysis cycle
We will establish a system for investigating the cause when an alert occurs.
LATENCY DOCTOR
Continuous threshold adjustment
As operations continue, the normal ranges for things like cache hit rates and fragmentation rates will likely change as the service grows.
Based on this data, it is necessary to continually review and improve alert thresholds to reduce false positives and detect anomalies earlier.
Conclusion
I believe that building a proactive monitoring system through the steps we have introduced will lead to stable operation of services that utilize Redis/Valkey and provide an excellent user experience.
Finally, we provide links to best practices related to various cloud services and their official configuration and operation.
SRG is looking for people to work with us.
If you are interested, please contact us here.