Providing canary releases with Argo Rollouts

My name is Taninari, and I work in the Service Reliability Group (SRG) of the Media Headquarters.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
This article introduces the process of using Arog Rollouts to provide canary release features, eliminating the risk of impacting all users with traditional mass deployments and providing safe and rapid feature release through gradual traffic control and automated monitoring.
 

Why do we need canary releases?


Improving system productivity and reliability is an important issue in development. In particular, when releasing new features or fixes, there is always a risk that unexpected defects will affect the entire service.
One method for minimizing this risk and releasing safely and quickly is "canary release."
Canary release is a deployment method in which a new version of an application is first released to a limited number of users, and then gradually expanded to all users after verifying that there are no problems.
Introducing canary releases offers the following benefits:

Reduced release risk

Even if a problem occurs, the scope of the impact can be limited, allowing for quick recovery.

Quality verification in a production environment

You can validate the performance and stability of new features in a production environment with real user access.

Improved user experience

By preventing large-scale failures and providing stable services, we will increase user trust and satisfaction.

Ensuring psychological safety for developers

Even if the final product contains bugs, the damage can be minimized, reducing the psychological burden on developers when releasing the product.

Research into how to achieve canary release


Although it is simply called a canary release, there are many tools and methods to achieve it.
We investigated the optimal method, taking into consideration compatibility with the existing platform environment and implementation costs.

Comparing different tools

First, we looked at several tools that might enable canary releases.
  • Linkerd
    • It offers the same service mesh functionality as Istio, but we decided not to use it because we were already using Istio.
  • Consul Connect
    • This is also a type of service mesh, and while it is a powerful option for multi-cloud environments, it did not meet our current requirements.
  • Spinnaker
    • Although it is a highly functional CI/CD tool, we decided that its implementation would be too large-scale for the sole purpose of canary release.
  • Keptn
    • Although it is possible to automate advanced operations, such as measuring SLOs (service level objectives), we decided that this would be too large-scale to achieve this canary release.
These tools are very powerful, but their implementation requires considerable learning costs and operational overhead.
So we're committed to leveraging the technology stack already in place on the Platform.

Argo Rollouts Selection

In our environment, we are already using Istio as a service mesh and Argo CD as a CD tool.
DestinationRule
However, with Istio alone, it is difficult to achieve "progressive delivery," which involves monitoring metrics such as error rates and automatically advancing or rolling back rollouts based on the results.
That's when I turned my attention to "Argo Rollouts."
Argo Rollouts is a Kubernetes-native progressive delivery tool that you install as a separate component from Argo CD.
We decided that by introducing Argo Rollouts, we could integrate it with our existing Istio and enable safe, automated canary releases without compromising the developer experience.

Canary Release Practice with Argo Rollouts


From here, we will explain the steps we took to actually implement Argo Rollouts and test a canary release.

Installing Argo Rollouts

First, install the Argo Rollouts controller in your cluster:
Basically, you can install it by following the official documentation and using the following command.

Basic Canary Release

Rollout
strategy
AnalysisTemplate
Rollout Resource Example
AnalysisTemplate example (when integrated with Datadog)
With this configuration, when you deploy a new version, 10% of traffic will be directed to the new version first.
AnalysisTemplate
However, there is one problem with this method alone.
Argo Rollouts alone controls traffic based on the number of Pod replicas.
For example, if you have three Pods and try to direct 10% of your traffic to them, one new Pod will actually be launched, and approximately 33% of the traffic will be directed to it.
To achieve stricter traffic control, integration with Istio is required.

Advanced traffic control with Istio integration

DestinationRule
Rollout resource with Istio integration configured
Linked VirtualService and DestinationRule
weight
This allows for precise percentage-based traffic control that is independent of the number of Pods.

Argo Rollouts Operations Guide


We will introduce the basic operations required for actually running Argo Rollouts.

Working with Argo Rollouts

kubectl
You can also launch the dashboard on your local PC and check the status in the GUI.
Promote-Full

Points to note

ReplicaSet
I don't think it will be a big problem, but if you leave it as it is, you will end up retaining unnecessary resources, so you will need to be careful about when deleting resources during operation.

summary


In this article, we investigated and implemented methods for implementing canary releases on Platform with the aim of improving productivity and reliability.
After comparing various tools, we decided to adopt Argo Rollouts, which is highly compatible with our existing technology stack, Istio, and enables progressive delivery.
We were able to confirm that the combination of Argo Rollouts and Istio enables precise percentage-based traffic control while incorporating metrics-based analysis, enabling the creation of a safe and flexible release flow.
We will continue to make improvements and aim to provide a more stable service.
 
If you are interested in SRG, please contact us here.