Path to optimizing ArgoCD performance in HA configuration
#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.
This article introduces methods to solve the performance issues of ArgoCD in HA configuration. Specifically, it explains optimization techniques to deal with deployment delays caused by resource increases and how to implement load balancing using sharding.
I hope this helps in some way.
ArgoCD is slowOne day, you notice a delay in deployment.Current situation surveySurvey resultsWhat to doFactors affecting ArgoCD performanceKey Components of ArgoCDCore MetricsElement 1: Number of RepoServersElement 2: Reconcile intervalElement 3: Adjust the number of Controller processorsElement 4: Kubernetes API Server request relatedElement 5: Using Helm/Kustomize and MonoRepoElement 6: Number of Application Controllers (number of shards)Trying out ShardingComparison of sharding algorithmsOptimal Sharding SolutionImprovement resultsBefore responseFirst step in enabling ShardSecond step in shard activationSharding experiments and application of optimal solutionsOther tweaksUI PerformanceUnmanaged Resource ProblemsConclusion
ArgoCD is slow
It has already been four years since we migrated the continuous delivery (CD) function of our A product to ArgoCD. During this period, the number of services we have migrated from legacy systems has continued to increase, and the resources managed in our CD environment have expanded rapidly. Although we achieved smooth deployment at the beginning of the migration, new challenges that have arisen in the past few years have highlighted the need for optimization.
One day, you notice a delay in deployment.
One day, a backend developer reported that deployment was taking more than 30 minutes. When I checked the ArgoCD UI, I was shocked to see that the deployment was taking longer than 30 minutes.
Context deadline exceeded
Current situation survey
Project/ApplicationSet
The Application Source Repo is divided into three parts depending on their purpose.
- ArgoCD Application Definition (Helm Chart)
- Application Manifest (Kustomize)
- Cluster Component CRD Manifest (Kustomize + Helm Chart)
There are approximately 250 applications, and approximately 30,000 resources managed and tracked via ArgoCD.
The ArgoCD configuration is the default configuration for HA.
Survey results
90%
error during container init: error setting cgroup config for procHooks process: unable to freeze: unknow
Also, there was no dedicated monitoring for ArgoCD.
What to do
First, we decided to obtain metrics for ArgoCD-related pods and create a dedicated dashboard in Datadog.
If you add the following annotation to your PodTemplateSpec, you can check the metrics on the ArgoCD dashboard provided by Datadog. For details, see this link.
The metrics that can be collected are different from those in Prometheus, so see this for details.listIt will be.
The lack of scale of the ArgoCD Application Controller is quite noticeable. As we continued our investigation, we felt the need to reconsider the HA configuration of ArgoCD, so we conducted a detailed investigation based on several reference materials.
- These are guidelines for HA configuration published in the official ArgoCD documentation. They are particularly detailed and cover all possibilities, but the configuration may change depending on the version, so you should proceed while checking the version and release notes.
- Sync 10,000 Argo CD Applications in One Shot. By Jun Duan, Paolo Dettori, Andy Anderson: This document introduces a quantitative study on the scalability of ArgoCD. It includes benchmark data on the performance when syncing 10,000 applications, as well as experimental data on the number of RepoServers and reconcile intervals, making it a very useful indicator of the load on ArgoCD's application management.
- Argo CD Benchmarking Series. By Andrew Lee: The most comprehensive analysis of factors affecting ArgoCD performance. It helped us identify the bottleneck in this case.
Factors affecting ArgoCD performance
Key Components of ArgoCD
To understand Argo CD's performance, it's important to understand exactly what its key components do and how they work.
API Server()
role:
- Acts as an authentication and authorization gateway and accepts all operation requests.
- Handle requests from the CLI, Web UI, or webhooks
AppProject
Operation:
- Receive API requests through the UI, CLI, or Git Webhook events
- Authenticate users using JWT, SSO and apply RBAC policies
- Perform CRUD of Application CRD resources according to the request
2. Repo Server()
role:
- Get the source code from the Git repository and generate Kubernetes manifests
- Handles a variety of source formats, including Helm, Kustomize, and plain YAML
Operation:
- In response to a request from the Application Controller, retrieves source code from the specified Git repository and revision.
- Generate Kubernetes manifests using Helm, Kustomize, etc.
- Cache the generated manifest on the local file system
- Store some information (hash value of manifest) in memory
3. Application Controller()
role:
Application
- Properly create, update, and delete resources on a Kubernetes cluster
Operation:
Application
Application
- Call the Repo Server and get the latest manifest (Desired State)
- Get the current resource state through the Kubernetes API (Live State)
- Compare the desired state and the live state and calculate the difference
- Based on the diff, determine the required resource CRUD operations
Core Metrics
Application Controller
- Workqueue Work Duration Seconds
argocd.app_controller.workqueue.work.duration.seconds.bucket
This metric indicates the time it takes for the ArgoCD Application Controller to process an item in the WorkQueue. If the processing time is long, it may be a bottleneck, so it should be monitored.
- Workqueue Depth
- This is the queue where ArgoCD runs reconcile to keep manifests consistent between the Git repository and Redis. This synchronizes the state of the Git repository with the cache. If changes to the repository occur frequently, processing this queue can take a long time.
- This queue allows ArgoCD to keep manifests consistent between Redis and your Kubernetes cluster, and to sync and deploy your applications.
argocd.app_controller.workqueue.depth
app_operation_processing_queue
app_Reconcile_queue
app_operation_processing_queue
The above two queues are processed by the number of processors. For details, see element 3.
- Process CPU Seconds
argocd.app_controller.process.cpu.seconds.count
This metric indicates the CPU time consumed by the Application Controller. If multiple Application Controllers are used, the average value is taken to monitor performance. Kubernetes Pod CPU metrics can also be used instead.
RepoServer
- Git Request Duration Seconds
argocd.repo_server.git.request.duration.seconds.bucket
This metric indicates the time it takes for ArgoCD's Repo Server to process a Git request. If it takes a long time to access the Git repository, Sync delays may occur.
Element 1: Number of RepoServers
To improve the performance of ArgoCD, it is effective to increase the number of replicas of the Repo Server. Increasing the number of replicas will reduce the sync time of the entire application. By increasing the number of replicas appropriately, manifest generation will be parallelized and the sync process will be faster.
The article Sync 10,000 Argo CD Applications in One Shot reports that by tripling the number of replicas, the overall Sync time was reduced by one-third.
Element 2: Reconcile interval
app_Reconcile_queue
In the first round of Argo CD Benchmarking, Sync, which would normally take 30 minutes to complete, was reduced to zero over a period of 6 to 12 minutes by increasing the Reconcile interval from 3 minutes to 6 minutes.
Element 3:Adjusting the Controller Processor Count
controller.operation.processors
- controller.status.processors: Number of processors for monitoring and updating the application state
appRefreshQueue(app_Reconcile_queue)
- controller.operation.processors: Number of processors to execute operations on Kubernetes
appOperationQueue(app_operation_processing_queue)
The default value corresponds to 400 applications.controller.status.processors20 tocontroller.operation.processorsis set to 10. For 1000 applications,controller.status.processors50、controller.operation.processorsIt is recommended that 25 be designated.
In the second article of Argo CD Benchmarking, it was reported that doubling the number of processors reduced the Sync time by 33%. However, when increasing the number of processors, it is also important to balance it with the processing power of requests to the Kubernetes API server (Kubernetes Client QPS/Burst). If increasing the number of processors does not improve the performance, we recommend setting Kubernetes Client QPS/Burst.
Element 4: Kubernetes API Server request related
K8S_CLIENT_BURST
K8S_CLIENT_BURST
- Twice the default setting: 67% reduction in sync time
- 3x the default setting: 77% reduction in sync time
K8S_CLIENT_BURST
ARGOCD_K8S_CLIENT_QPS/ARGOCD_K8S_CLIENT_BURST
Element 5: Using Helm/Kustomize and MonoRepo
Using Helm or Kustomize to generate application manifests can have a performance impact, especially in Monorepo environments, where multiple applications are included, and the generation process becomes more complex as the entire repository is manipulated.
The Repo Server clones the Git Repo locally and generates a manifest. If the manifest generation requires changes to repository files, only one parallel process is allowed per Repo Server replica, so if there are many applications in the monorepo, this will become a bottleneck and performance will decrease.
In particular, in a monorepo environment that contains 50 or more applications, this parallel processing limit often slows down processing when multiple manifest generation processes occur. If the use of Helm or Kustomize is required and a monorepo is used, the configuration must take this limit into account.
Element 6: Number of Application Controllers (number of shards)
ArgoCD's Application Controllers are deployed as StatefulSets and scale out using Sharding. By using multiple Application Controller Shards, you can balance the load and improve performance.
The ArgoCD HA documentation recommends sharding if your Application Controller is managing multiple clusters or consuming a lot of memory.
If the controller is managing too many clusters and uses too much memory then you can shard clusters across multiple controller replicas.
The Application Controller currently supports three sharding algorithms:
Lagacy
uid
Round-robin
A simple algorithm that distributes applications evenly across all shards. This algorithm is also widely used in OS scheduling, and distributes applications without considering priority. ArgoCD assigns an order to all applications and distributes them across all shards in that order. However, if multiple clusters are managed, distribution is performed on a cluster-by-cluster basis. The round-robin method is available from 2.8.x onwards.
Consistent-hashing
Distributes applications using consistent hashing. In addition to even load balancing, it has the advantage of minimizing resource relocation when shards or clusters are added or removed. Consistent-hashing also does not consider priorities and distributes on a cluster-by-cluster basis. Consistent-hashing is available from 2.12.x onwards.
As we will show using real data later, the above three algorithms are not optimal in the following cases:
- When multiple clusters are managed and clusters have priority
- When managing multiple clusters and each cluster has different resource amounts
- When the resource usage of an application changes dynamically
In the second round of ArgoCD Benchmarking, we compared three algorithms based on CPU and memory fluctuations. As a result, Consistent-hashing showed the most stable performance.
Sharding is configured in three places:
ARGOCD_CONTROLLER_REPLICAS
- Number of StatefulSet replicas
controller.sharding.algorithm
Trying out Sharding
All three sharding algorithms currently in use are designed to distribute resources equally. However, in a multi-cluster environment, they can only distribute resources on a cluster-by-cluster basis, so there are many reports of suboptimal performance when there are differences in the priorities or number of resources per cluster. Complaints about the algorithms have been raised for many years on GitHub issues.
In our initial implementation, we faced exactly this limitation of the sharding algorithm.
Comparison of sharding algorithms
A product is composed of multiple clusters. Each cluster has a different priority and the number of resources is different. The cluster list on the ArgoCD side is as follows.
dev-eks
stg-eks
prd-eks
sbx-eks
shd-eks
local
ClusterSecret
shd-eks
Project/Application/ApplicationSet
shd-eks
Legacy
stg-eks
stg-eks
Shard数 < クラスタ数
Round-Robin
Optimal Sharding Solution
Increasing the number of shards can reduce the risk of clusters with high loads being placed on the same shard, but it cannot completely solve the problem based on cluster priority.
ClusterSecret
クラスタ数 = Shard数
In addition, the following disadvantages of Sharding must be mentioned:
- If a shard is stopped, its tasks will not be handed over to other shards.
Therefore, monitoring the Pods for each shard is very important.
Deployment
Improvement results
Let's look at some key core metrics of these improvements:




Before response
Before implementing the improvement measures, we analyzed the current situation while collecting metric information. CPU usage was always maintained at 1.8 or higher, and memory was allocated sufficiently, but it was increasing to 4GB. The WorkQueue depth was always above 100, reaching 350 at its peak, and approaching that peak every two hours. We determined that the reason for the WorkQueue congestion was the long average WorkQueue processing time.
First step in enabling Shard
When Legacy Shard was enabled in v2.7, the initial switch required a Controller restart, which caused the WorkQueue depth to temporarily rise to 500, but then gradually decreased as the Controller replicas started up.
As a result, the WorkQueue processing time was reduced by more than 60%, but because multiple clusters were coexisting on the same shard, there was a large fluctuation in processing time. We confirmed that the processing time tended to increase when multiple clusters started processing at the same time, and to decrease when there was no simultaneous processing.
Second step in shard activation
Round-robin
repository i/o timeout
Additionally, we believe that the influence of the network environment was also a factor, so we took steps to increase the number of retries.
Sharding experiments and application of optimal solutions
Consistent-hashing Shard
Unknown
Even in an application with more than 2,000 resources, the sync was completed within one minute from start to finish, and all UI errors were resolved.
In terms of resource usage of each component, CPU usage was stable at an average of 900M and memory at 700MB, and overall performance remained stable even when a large number of applications were reconciled at the same time.
Other tweaks
The following adjustments are not reflected in the graph analysis above because we did not set up a controlled experiment, but we list them here because they helped us eliminate some errors:
- Extending the Reconcile Period
Reconcile every 3 minutes was not required for Product A, so the reconcile period was extended to 5 minutes. Also, to reduce the load when multiple applications reconcile at the same time, a jitter setting was introduced. This allows reconcile to be performed randomly at intervals between 5m and (5m + 1m).
- Dealing with Helm/Kustomize and Monorepos
In Product A, a specific repository manages the manifests of product applications, and another repository manages the manifests of applications managed by operators. Since Kustomize and Helm are used extensively in each repository, we extended the timeout of the Repo Server to allow Helm to run in parallel.
- Adjusting K8S_CLIENT_QPS/BURST
We doubled the number of Controllers, Repo Servers, and API Servers.
UI Performance
Comparing the results before and after the start of the improvement efforts and manual sharding operations, we obtained the following results:
Before improvement (FCP, LCP) | After improvement (FCP, LCP) | |
Application List | 712ms, 1.8s | 562ms, 1.48s |
Application details for 800+ resources | 820ms, 950ms | 778ms, 848ms |
Application details with 2000+ resources | 645ms, 2.45s | 600ms, 1.12s |
Overall, we saw improvements ranging from 11% in FCP to up to 27% in LCP, with some specific pages seeing LCP improvements of over 50%.
We were only able to measure the data before the improvements once between 10:00 a.m. and 12:00 noon, a time period when multiple developers use the system at the same time, so we are unable to provide accurate figures, but we could tell that performance had improved significantly.
However, in some applications with 2000+ resources, delays in screen drawing continue to occur, and the issue of the entire screen freezing, especially when opening resource details with a large amount of information, has not yet been resolved.
api/v1/stream/applications/<app_name>/resource-tree
Many performance issues with the ArgoCD UI have also been reported on Github Issues, with performance degradation in particular when the number of resources is large being a problem.
The most promising solution at the moment is Server-side paging, which is on the Roadmap but not yet addressed.
Unmanaged Resource Problems
There are two main methods for Resource Tracking in ArgoCD: Annotation and Label.
In the Label method used by default, a label is assigned to the resource managed by ArgoCD, and the same label is applied to the child resource generated by that resource. This allows it to be treated as a managed resource of the ArgoCD Application. A Product was presented at CNDT 2021, where they used Kubevela. This is a follow-up story.
For example, in Product A, there is an ArgoCD application that manages more than 70 KubeVela applications, and the total number of resources generated by KubeVela reaches more than 2500. In this case, there was a significant delay in the ArgoCD UI rendering speed and the Controller reconcile process, resulting in a noticeable degradation in performance.
This resource management issue has not yet been resolved in A Product and continues to be a challenge.
Conclusion
Thank you for reading this far.
ArgoCD is easy to operate at the beginning, but as the number of applications increases, the number of things to be careful about also increases. Fortunately, thanks to the active participation of the ArgoCD community, many performance-related parameters have been added in recent years. If there is an opportunity to contribute to ArgoCD in the future, I would like to try it.
SRG is looking for people to work with us. If you are interested, please contact us here.