Essential pre-launch tasks for Spanner-based services: Pre-splitting to prepare for sudden traffic spikes.
This is Masaya Matsuda (@mm_matsuda816) from the Service Reliability Group (SRG) of the Media Division.
#SRGThe Service Reliability Group primarily provides comprehensive support for the infrastructure surrounding our media services, focusing on improving existing services, launching new ones, and contributing to open-source software (OSS).
This article is aboutCyberAgent Group SRE Advent Calendar 2025This is the article for day 16.
This article introduces Spanner's "Pre-splitting" feature, which prevents latency degradation immediately after launch. We provide a detailed explanation, from design guidelines to configuration methods, and summarize key points for smoothly launching services even during traffic surges.
The phenomenon of Spanner "getting stuck" immediately after launch.Pre-splitting (split points) mechanism and designHow many times should it be divided?Where to divide itCreating and verifying Split Points1. Creating Split PointsMeasure the number of splits.Points to note regarding operationImpact on latencyMonitoring and ExpirePre-launch checklist
The phenomenon of Spanner "getting stuck" immediately after launch.
Google Cloud Spanner is a powerful database that combines the consistency of relational databases with the horizontal scalability of NoSQL.
However, immediately after a large-scale service launch or campaign start, "bottlenecks" can occur where expected performance is not achieved and latency worsens.
Many of these problems stem from the behavior of Split, Spanner's data distribution mechanism.
Spanner typically detects data volume (size) and load (load) and automatically divides the data, distributing it across multiple nodes.
We'll call this automatic splitting in this article, but there is a time lag between detecting the load and completing the splitting.
In other words, in scenarios where traffic suddenly jumps from "0 to 100" immediately after launch, automatic splitting cannot keep up, and access becomes concentrated on specific nodes, turning them into hotspots.
The feature introduced to address this issue became generally available on April 28, 2025.Pre-splitting(split points)is.
Pre-splitting (split points) mechanism and design
Pre-splitting is a feature that manually creates split points in advance before traffic arrives, distributing data across multiple nodes.
This makes it possible to handle spikes in access immediately after launch with multiple nodes from the start.
How many divisions should it be?
The official documentation states that the number of divisions is a guideline.10 split points per nodeIt is said that...
For example, when using a 5-node instance in a production environment, it is recommended to create approximately 5 × 10 = 50 split points.
Furthermore, while automatic splitting may be sufficient for smaller instances and therefore not always necessary, it should be implemented when large-scale traffic is expected.
Where to divide it
The location of the split (the range of keys) depends on the key design of the data.
- When keys are evenly distributed (such as UUIDs or hash values)
The entire key space will be divided evenly.
30- When a specific range becomes hot
If you know that there is a concentration of access to specific user IDs or categories, focus on segmenting that area.
Furthermore, it is recommended to create split points not only for tables but also for indexes that experience high traffic.
Creating and verifying Split Points
This section explains the steps to actually create split points and verify that they have been applied correctly.
Using the Google Cloud CLI (gcloud) is the most common method.
1. Creating Split Points
gcloud spanner databases splits addYou can either prepare a file describing the division points or specify them via the command line.
The following is an example of the command.
--split-expiration-dateSplit points are not permanent and should be reverted to Spanner's automated management once traffic stabilizes.
By default, it can be set for 10 days, and up to a maximum of 30 days.
Measure the number of splits.
Verify that the created split points have been correctly reflected.
jqPoints to note regarding operation
Impact on latency
Increasing the number of divisions doesn't necessarily mean it's better.
If the transaction is split too much, the number of nodes a single transaction spans (Transaction Participants) increases, which can permanently worsen Read/Write latency.
Additionally, the resource usage (Compute/Query usage) for queries may also increase.
Monitoring and Expire
After launch, please monitor the "Latency Profile" and "Key Visualizer" in the Cloud Console.
If excessive partitioning is negatively impacting performance (e.g., increased overall latency rather than just hotspots), you may need to manually expire (deactivate) the partitions before the set expiration date and revert to Spanner's automatic management.
Once the expiration date passes, split points will no longer be displayed and may be automatically merged (combined) depending on traffic conditions.
This is normal behavior and means that the Spanner has transitioned to autonomous Split.
Pre-launch checklist
Finally, here's a summary of things to check before launching a service that uses Spanner.
- Securing the number of nodes
Pre-splitting is not a function that increases capacity.
First, as a prerequisite, ensure that you have provisioned a sufficient number of nodes to handle the expected traffic.
- Is the number of Split Points appropriate?
Please check if the setting is approximately "number of nodes × 10".
- Is the timing of the pre-splitting appropriate?
The official recommendation is "7 days to 12 hours before launch."
Submitting too early risks the deadline expiring before launch, while submitting too close to the launch date may result in the changes not being reflected in time.
- Check the restrictions
Search Index and Vector Index are not subject to pre-splitting.
For workloads that depend on these, additional load balancing measures may be necessary.
- Surveillance during launch
Monitor Key Visualizer for hotspots or latency issues caused by excessive partitioning, and be prepared to expire settings if necessary.
To maximize Spanner's performance and ensure a smooth launch, make effective use of pre-splitting.
SRG is looking for new team members.
If you are interested, please contact us here.
