HOME/Technical Articles/Prevent response degradation before it happens! Redis/Valkey monitoring basics and important metrics

Prevent response degradation before it happens! Redis/Valkey monitoring basics and important metrics

2025/8/29 13:432025/8/29 17:43

This is Kobayashi (@berlinbytes) from the Service Reliability Group (SRG) of the Media Headquarters.

#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.

This article explains the monitoring concepts essential for stable operation of Redis/Valkey, and introduces specific metrics and practical steps to prevent response degradation.

Introduction

At our company, we use Redis and its open source fork, Valkey, as extremely fast in-memory data stores for a wide range of purposes, including caching, session stores, real-time analytics, and primary databases.

Its distinctive feature is its extremely low latency, achieved by directly manipulating data in memory.

However, because of its speed, even the slightest performance degradation can affect the overall system responsiveness and degrade the user experience.

In this article, I will summarize the monitoring concepts essential for stable operation of Redis and Valkey, and introduce specific metrics and practical steps to prevent response degradation.

The Five Key Layers of Monitoring

To build an effective monitoring system, it is useful to think of the system as being divided into multiple layers.

Here we will explain the five monitoring layers you should keep in mind.

1. Performance (Latency & Throughput)

Performance is the metric that most directly impacts user experience.

It is important to measure both the round trip latency from the application's perspective and the delay in processing commands within the server.

Monitor commands processed per second along with cache hit rates to get a complete picture of performance.

Redis/Valkey typically responds in sub-milliseconds, so even a small increase in latency signals a significant change.

Also, since Redis/Valkey are essentially single-threaded, CPU usage is also an important indicator of server performance.

LATENCY HISTORY

2. Memory Health

For in-memory databases like Redis/Valkey, memory is the most important resource.

evicted_keys

A fragmentation ratio consistently above 1.5 is a sign of excessive fragmentation.

Conversely, if the value remains below 1.0, there is a possibility that swapping is occurring due to the OS, and caution is advised.

3. Replication and Clusters

If you use replication or cluster configuration to increase availability, monitoring its health is also essential.

It monitors the data synchronization delay (replication lag) between the primary and replica, the difference in synchronization position (offset), the connection status (link status), etc.

Managed services such as AWS ElastiCache and Google Cloud Memorystore often provide these metrics as standard.

4. The effects of persistence and forking

When using snapshots (RDB) or append-only files (AOF) for data persistence, background save operations can affect performance.

fork()

By recording a flag indicating the status of background saving processing and the time it took to process, it is useful for isolating performance issues when they occur.

5. Connect to the client

The connection status from the client is also an important subject to monitor.

Monitor the number of currently connected clients, the number of clients waiting due to a blocking command, and the number of refused connections.

If connection refusals are occurring, the maximum number of configured clients may have been reached, and you will need to review your resources and check the connection pooling settings on the application side.

Top metrics and alerts to monitor

To get started with monitoring, we will introduce some core metrics that are particularly important and examples of setting up alerts to detect abnormalities.

Core Metrics List

memory

maxmemory
mem_fragmentation_ratio
evicted_keys
active_defrag_*

CPU usage

Main thread CPU usage

Be careful when under heavy load

activity

connected_clients
Replication Lag

Latency

The delay in processing a command, measured in conjunction with the response time from the application.

Cache efficiency

Cache hit rate, which indicates the efficiency of the cache.

network

The round-trip time (RTT) between the client and the instance.

Practical alert configuration examples

LATENCY

evicted_keys

Memory pressure: When memory usage exceeds 80% and fragmentation remains high.

Replication Lag: When replication lag continues beyond a predefined tolerance (e.g. 1-3 seconds).

blocked_clients

How to create an effective monitoring dashboard

Rather than simply collecting metrics, creating dashboards that provide an at-a-glance understanding of the situation will greatly improve operational efficiency.

Below is an example of a dashboard configuration.

PING

SET

evicted_keys

Throughput panel: Displays the number of commands processed per second (OPS), network traffic, and the amount of data transferred via replication, allowing you to understand load trends.

Client Panel: Monitor trends such as number of connected clients, number of blocked clients, number of rejected connections, and number of authentication failures.

Cache Efficiency Panel: Displays the cache hit rate in a separate large panel, allowing you to detect any sudden drops early.

Action plan for implementing surveillance

Finally, we will introduce a specific action plan for implementing monitoring from now on.

Establishing a Baseline

First, understand the normal behavior of the system.

Record key metrics such as latency, memory usage, OPS, and number of connections over a period of 1-2 weeks to establish trends (baselines) by time of day and day of the week.

Setting up an initial alert

Use baselines to set alerts with realistic thresholds.

For example, start with a latency of "baseline + 5ms," memory usage of "80%" or "90%," and set connection refusals and client blocks to notify you immediately upon occurrence.

Establishing an analysis cycle

We will establish a system for investigating the cause when an alert occurs.

LATENCY DOCTOR

Continuous threshold adjustment

As operations continue, the normal ranges for things like cache hit rates and fragmentation rates will likely change as the service grows.

Based on this data, it is necessary to continually review and improve alert thresholds to reduce false positives and detect anomalies earlier.

Conclusion

I believe that building a proactive monitoring system through the steps we have introduced will lead to stable operation of services that utilize Redis/Valkey and provide an excellent user experience.

Finally, we provide links to best practices related to various cloud services and their official configuration and operation.

General Best Practices - Amazon ElastiCache

Amazon ElastiCache is a web service that makes it easy to set up, manage, and scale a distributed in-memory data store or cache in the cloud.

https://docs.aws.amazon.com/ja_jp/AmazonElastiCache/latest/dg/WorkingWithRedis.html

General Best Practices | Memorystore for Redis | Google Cloud

This page provides guidance on optimal use of Memorystore for Redis, including potential issues to avoid.

https://cloud.google.com/memorystore/docs/redis/general-best-practices?hl=ja

General Best Practices | Memorystore for Valkey | Google Cloud

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see the launch stage descriptions.

https://cloud.google.com/memorystore/docs/valkey/general-best-practices?hl=ja

Development Best Practices - Azure Cache for Redis

Learn how to develop code for Azure Cache for Redis.