Optimizing ArgoCD performance in HA configurations
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
This article introduces methods to solve the performance issues of ArgoCD in HA configuration. Specifically, it explains optimization techniques to deal with deployment delays caused by increasing resources and how to implement load balancing using sharding.
I hope this helps in some way.
ArgoCD is slowOne day, you notice a delay in deploymentCurrent situation surveySurvey resultsWhat should I do?Factors affecting ArgoCD performanceKey Components of ArgoCDCore MetricsElement 1: Number of RepoServersElement 2: Reconcile intervalElement 3: Adjust the number of Controller processorsElement 4: Kubernetes API Server Request RelatedElement 5: Using Helm/Kustomize and MonoRepoElement 6: Number of Application Controllers (Number of Shards)Try ShardingComparison of Sharding AlgorithmsOptimal Sharding SolutionImprovement resultsBefore responseFirst step in activating a ShardShard Activation Step 2Sharding experiments and application of optimal solutionsOther adjustmentsUI PerformanceUncontrolled Resources IssuesConclusion
ArgoCD is slow
It's already been four years since we migrated the continuous delivery (CD) function of our A product to ArgoCD. During this period, the number of services we migrated from legacy systems has continued to increase, and the resources managed in our CD environment have rapidly expanded. While the migration initially went smoothly, new challenges have arisen over the past few years, highlighting the need for optimization.
One day, you notice a delay in deployment.
One day, a backend developer reported that deployment was taking more than 30 minutes. I immediately checked the ArgoCD UI and was shocked to find that the situation was not right.
Context deadline exceeded
Current situation survey
Project/ApplicationSet
The application source repo is divided into three parts depending on the purpose.
- ArgoCD Application Definition (Helm Chart)
- Application Manifest (Kustomize)
- Cluster Component CRD Manifest (Kustomize + Helm Chart)
There are approximately 250 applications, and approximately 30,000 resources managed and tracked via ArgoCD.
The ArgoCD configuration is the default configuration for HA.
Survey results
90%
error during container init: error setting cgroup config for procHooks process: unable to freeze: unknow
Also, there was no dedicated monitoring for ArgoCD.
What should I do?
First, we decided to obtain metrics for ArgoCD-related pods and create a dedicated dashboard in Datadog.
If you add the following annotation to your PodTemplateSpec, you can view metrics on the ArgoCD dashboard provided by Datadog. For details, see this link.
The metrics that can be collected are different from those in Prometheus, so please see this for details.listIt will be.
The lack of scale of the ArgoCD Application Controller was quite noticeable. As we continued our investigation, we felt the need to reconsider the HA configuration of ArgoCD, so we conducted a detailed investigation based on several reference materials.
- These are guidelines for HA configuration published in the official ArgoCD documentation. They are particularly detailed and cover all possibilities, but the configuration may change depending on the version, so you should proceed while checking the version and release notes.
- Sync 10,000 Argo CD Applications in One Shot. By Jun Duan, Paolo Dettori, Andy Anderson: This paper presents a quantitative study on the scalability of ArgoCD. It includes performance benchmarks for syncing 10,000 applications, as well as experimental data on the number of RepoServers and reconcile intervals. This is a very useful indicator of the load on ArgoCD application management.
- Argo CD Benchmarking Series. By Andrew Lee: This is the most comprehensive analysis of factors affecting ArgoCD performance. It helped us identify the bottleneck in our case.
Factors affecting ArgoCD performance
Key Components of ArgoCD
To understand Argo CD's performance, it's important to understand exactly what its key components do and how they work.
API Server()
role:
- Acts as an authentication and authorization gateway and accepts all operation requests
- Handle requests from the CLI, Web UI, or webhooks
AppProject
Operation:
- Receive API requests via the UI, CLI, or Git Webhook events
- Authenticate users using JWT, SSO, and apply RBAC policies
- Perform CRUD of Application CRD resources according to the request content
2. Repo Server()
role:
- Get source code from a Git repository and generate Kubernetes manifests
- Handles various source formats, including Helm, Kustomize, and plain YAML
Operation:
- In response to a request from the Application Controller, retrieves source code from the specified Git repository and revision.
- Generate Kubernetes manifests using Helm, Kustomize, etc.
- Cache the generated manifest to the local file system
- Store some information (hash value of manifest) in memory
3. Application Controller()
role:
Application
- Properly create, update, and delete resources on a Kubernetes cluster
Operation:
Application
Application
- Call the Repo Server and get the latest manifest (Desired State)
- Get the current resource state through the Kubernetes API (Live State)
- Compare the desired state with the live state and calculate the difference
- Based on the diff, determine the required resource CRUD operations
Core Metrics
Application Controller
- Workqueue Work Duration Seconds
argocd.app_controller.workqueue.work.duration.seconds.bucket
This metric indicates the time it takes for ArgoCD's Application Controller to process items in the WorkQueue. If the processing time is long, it may be a sign of a bottleneck, so it should be monitored.
- Workqueue Depth
- This is the queue where ArgoCD runs reconcile to keep manifests consistent between the Git repository and Redis. This synchronizes the state of the Git repository with the cache. If changes to the repository occur frequently, processing this queue can take some time.
- This is the queue that ArgoCD uses to keep manifests consistent between Redis and the Kubernetes cluster, and to sync and deploy applications.
argocd.app_controller.workqueue.depth
app_operation_processing_queue
app_Reconcile_queue
app_operation_processing_queue
The above two queues are processed for each processor. For details, see element 3.
- Process CPU Seconds
argocd.app_controller.process.cpu.seconds.count
This metric indicates the CPU time consumed by the Application Controller. If multiple Application Controllers are used, the average value is used to monitor performance. Kubernetes Pod CPU metrics can also be used instead.
RepoServer
- Git Request Duration Seconds
argocd.repo_server.git.request.duration.seconds.bucket
This metric shows the time it takes for ArgoCD's Repo Server to process a Git request. If it takes a long time to access the Git repository, sync delays may occur.
Element 1: Number of RepoServers
To improve the performance of ArgoCD, it is effective to increase the number of replicas of the Repo Server. Increasing the number of replicas will reduce the sync time for the entire application. By increasing the number of replicas appropriately, manifest generation will be parallelized, speeding up the sync process.
The article Sync 10,000 Argo CD Applications in One Shot reports that tripling the number of replicas reduced the overall Sync time by one-third.
Element 2: Reconcile interval
app_Reconcile_queue
In the first round of Argo CD Benchmarking, Sync, which would normally take 30 minutes to complete, was increased from 3 minutes to 6 minutes, and the number of OutOfSync events dropped to 0 over a period of 6 to 12 minutes.
Element 3:Adjusting the number of Controller processors
controller.operation.processors
- controller.status.processors: Number of processors for monitoring and updating the application status
appRefreshQueue(app_Reconcile_queue)
- controller.operation.processors: Number of processors that perform operations on Kubernetes
appOperationQueue(app_operation_processing_queue)
The default value corresponds to 400 applications.controller.status.processors20,controller.operation.processorsis set to 10. For 1000 applications,controller.status.processors50、controller.operation.processorsIt is recommended that 25 be designated.
The second article in the Argo CD Benchmarking series reported that doubling the number of processors reduced Sync times by 33%. However, when increasing the number of processors, it is also important to balance this with the processing power for requests to the Kubernetes API server (Kubernetes Client QPS/Burst). If increasing the number of processors does not improve performance, we recommend configuring Kubernetes Client QPS/Burst.
Element 4: Kubernetes API Server Request Related
K8S_CLIENT_BURST
K8S_CLIENT_BURST
- Twice the default setting: 67% reduction in Sync time
- 3x the default setting: 77% reduction in sync time
K8S_CLIENT_BURST
ARGOCD_K8S_CLIENT_QPS/ARGOCD_K8S_CLIENT_BURST
Element 5: Using Helm/Kustomize and MonoRepo
Generating application manifests using Helm or Kustomize can have a performance impact, especially in Monorepo environments, where multiple applications can be included, and the generation process can be complex as it involves working across repositories.
The Repo Server clones the Git Repo locally and generates a manifest. If the manifest generation requires changes to repository files, only one parallel process is allowed per Repo Server replica. If there are many applications in the monorepo, this can become a bottleneck and reduce performance.
In particular, in a monorepo environment containing 50 or more applications, this concurrency limit often slows down processing when multiple manifest generation processes occur. If you must use Helm or Kustomize and are using a monorepo, you must configure it with this limit in mind.
Element 6: Number of Application Controllers (Number of Shards)
ArgoCD's Application Controllers are deployed as StatefulSets and scale out using sharding. Using multiple Application Controller shards allows for load balancing and improved performance.
The ArgoCD HA documentation recommended sharding if the Application Controller was managing multiple clusters or consuming a lot of memory.
If the controller is managing too many clusters and uses too much memory then you can shard clusters across multiple controller replicas.
The Application Controller currently supports three sharding algorithms:
Lagacy
uid
Round-robin
This is a simple algorithm that distributes applications evenly across all shards. This algorithm is widely used in OS scheduling and performs distribution without considering priority. ArgoCD assigns an order to all applications and distributes them across all shards in that order. However, if multiple clusters are managed, distribution is performed on a cluster-by-cluster basis. The round-robin method is available from 2.8.x onwards.
Consistent-hashing
Applications are distributed using consistent hashing. In addition to even load distribution, it has the advantage of minimizing resource reallocation when shards or clusters are added or deleted. Consistent-hashing also does not consider priority and distributes on a cluster-by-cluster basis. Consistent-hashing is available from 2.12.x onwards.
As we will show later using real data, the above three algorithms are not optimal in the following cases:
- When multiple clusters are managed and clusters have priority
- When managing multiple clusters and each cluster has different resource amounts
- When the resource usage of an application changes dynamically
In the second ArgoCD Benchmarking, we compared three algorithms based on CPU and memory variability. As a result, consistent-hashing showed the most stable performance.
Sharding configuration is done in the following three places:
ARGOCD_CONTROLLER_REPLICAS
- Number of StatefulSet replicas
controller.sharding.algorithm
Try Sharding
All three sharding algorithms currently in use are designed to distribute resources evenly. However, in multi-cluster environments, they can only distribute resources at the cluster level. This means that performance is often suboptimal when there are differences in the priorities and number of resources between clusters. This issue has been a source of complaints about the algorithms for many years.
In our initial implementation, we faced exactly this limitation of the sharding algorithm.
Comparison of Sharding Algorithms
Product A consists of multiple clusters. Each cluster has a different priority and the number of resources it has. The cluster list on the ArgoCD side is as follows:
dev-eks
stg-eks
prd-eks
sbx-eks
shd-eks
local
ClusterSecret
shd-eks
Project/Application/ApplicationSet
shd-eks
Legacy
stg-eks
stg-eks
Shard数 < クラスタ数
Round-Robin
Optimal Sharding Solution
Increasing the number of shards can reduce the risk of clusters with high loads being placed on the same shard, but it cannot completely solve the problem of cluster priority.
ClusterSecret
クラスタ数 = Shard数
In addition, the following disadvantages of sharding must be mentioned:
- If a shard dies, its tasks will not be taken over by other shards.
This makes monitoring the Pods on each shard extremely important.
Deployment
Improvement results
Here are some key core metrics that explain the improvements:




Before response
Before implementing any improvements, we analyzed the current situation by collecting metrics information. CPU usage remained above 1.8, and although sufficient memory was allocated, it continued to increase to 4GB. The WorkQueue depth remained above 100, reaching 350 at its peak and approaching that peak every two hours. We determined that the cause of the WorkQueue congestion was the long average WorkQueue processing time.
First step in activating a Shard
When Legacy Shard was enabled in v2.7, the initial switch required a Controller restart, which caused the WorkQueue depth to temporarily rise to 500, but this gradually decreased as Controller replicas started up.
As a result, the WorkQueue processing time was reduced by more than 60%, but because multiple clusters coexisted on the same shard, there was a large fluctuation in processing time. We confirmed that processing time tended to increase when multiple clusters started processing simultaneously, and to decrease when there was no simultaneous processing.
Shard Activation Step 2
Round-robin
repository i/o timeout
Furthermore, we believe that the influence of the network environment is also a factor, so we have taken measures to increase the number of retries.
Sharding experiments and application of optimal solutions
Consistent-hashing Shard
Unknown
Even for applications with more than 2,000 resources, sync was completed within one minute from start to finish, and all UI errors were resolved.
In terms of resource usage for each component, CPU usage was stable at an average of 900M and memory at 700MB, and overall performance remained stable even when many applications were reconciled simultaneously.
Other adjustments
The following adjustments are not reflected in the graph analysis above because we did not set up a controlled experiment, but we list them here because they helped eliminate some errors.
- Extending the Reconcile Period
Reconcile every 3 minutes was not required for Product A, so the reconcile period was extended to 5 minutes. Also, to reduce the load when multiple applications reconcile simultaneously, a jitter setting was introduced. This means that reconcile will occur randomly at intervals between 5m and (5m + 1m).
- Dealing with Helm/Kustomize and Monorepos
Product A manages the manifests of product applications in a specific repository, and manages the manifests of applications managed by operators in a separate repository. Because they make heavy use of Kustomize and Helm, respectively, we extended the timeout for the Repo Server to allow Helm to run in parallel.
- Adjusting K8S_CLIENT_QPS/BURST
The Controller, Repo Server, and API Server have each been doubled.
UI Performance
Comparing the situation before and after manual sharding was implemented, we were able to obtain the following results:
Before improvement (FCP, LCP) | After improvement (FCP, LCP) | |
Application List | 712ms, 1.8s | 562ms, 1.48s |
Application details for 800+ resources | 820ms, 950ms | 778ms, 848ms |
Application details for 2000+ resources | 645ms, 2.45s | 600ms, 1.12s |
Overall, we saw improvements ranging from 11% in FCP to up to 27% in LCP, with some pages seeing LCP improvements of over 50%.
We were only able to measure the data before the improvements once between 10:00 AM and 12:00 PM, a time when multiple developers use the system at the same time, so we are unable to provide accurate figures, but we could tell from subjective experience that performance had improved significantly.
However, in some applications with over 2000 resources, delays in screen rendering continue to occur, and the issue of the entire screen freezing, especially when opening resource details with a large amount of information, has not yet been resolved.
api/v1/stream/applications/<app_name>/resource-tree
Many performance issues with the ArgoCD UI have also been reported on Github issues, with performance degradation particularly noticeable when the number of resources is large.
The most promising solution at the moment is server-side paging, which is on the roadmap but not yet addressed.
Uncontrolled Resources Issues
There are two main methods for Resource Tracking in ArgoCD: Annotation and Label.
In the default Label method, a label is assigned to the resource managed by ArgoCD, and the same label is applied to the child resource generated by that resource. This allows them to be treated as resources managed by the ArgoCD Application. Product A was featured at CNDT 2021, where they spoke about using Kubevela. This is a follow-up story.
For example, Product A had an ArgoCD application that managed over 70 KubeVela applications, and the total number of resources generated by KubeVela reached over 2,500. In this case, there were significant delays in ArgoCD UI rendering speed and Controller reconcile processing, resulting in noticeable performance degradation.
This resource management issue has not yet been resolved in Product A and remains a challenge.
Conclusion
Thank you for reading this far.
Operating ArgoCD is easy in the early stages, but as the number of applications increases, the number of things you need to be careful about increases. Fortunately, thanks to the active participation of the ArgoCD community, many performance-related parameters have been added in recent years. If there is an opportunity to contribute to ArgoCD in the future, I would definitely like to try it.
SRG is looking for people to work with us.
If you're interested, please contact us here.