HOME/Articles/Path to optimizing ArgoCD performance in HA configuration

Path to optimizing ArgoCD performance in HA configuration

2024/9/2 14:062024/11/7 16:36

Kumo Ishikawa (Service Reliability Group (SRG) of the Media Headquarters)@ishikawa_kumo)is.

#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.

This article introduces methods to solve the performance issues of ArgoCD in HA configuration. Specifically, it explains optimization techniques to deal with deployment delays caused by resource increases and how to implement load balancing using sharding.

I hope this helps in some way.

ArgoCD is slow One day, you notice a delay in deployment.Current situation survey Survey results What to do Factors affecting ArgoCD performance Key Components of ArgoCD Core Metrics Element 1: Number of RepoServers Element 2: Reconcile interval Element 3: Adjust the number of Controller processors Element 4: Kubernetes API Server request related Element 5: Using Helm/Kustomize and MonoRepo Element 6: Number of Application Controllers (number of shards)Trying out Sharding Comparison of sharding algorithms Optimal Sharding Solution Improvement results Before response First step in enabling Shard Second step in shard activation Sharding experiments and application of optimal solutions Other tweaks UI Performance Unmanaged Resource Problems Conclusion

ArgoCD is slow

It has already been four years since we migrated the continuous delivery (CD) function of our A product to ArgoCD. During this period, the number of services we have migrated from legacy systems has continued to increase, and the resources managed in our CD environment have expanded rapidly. Although we achieved smooth deployment at the beginning of the migration, new challenges that have arisen in the past few years have highlighted the need for optimization.

One day, you notice a delay in deployment.

One day, a backend developer reported that deployment was taking more than 30 minutes. When I checked the ArgoCD UI, I was shocked to see that the deployment was taking longer than 30 minutes.

Context deadline exceeded

Current situation survey

Project/ApplicationSet

The Application Source Repo is divided into three parts depending on their purpose.

ArgoCD Application Definition (Helm Chart)

Application Manifest (Kustomize)

Cluster Component CRD Manifest (Kustomize + Helm Chart)

There are approximately 250 applications, and approximately 30,000 resources managed and tracked via ArgoCD.

The ArgoCD configuration is the default configuration for HA.

Survey results

90%

error during container init: error setting cgroup config for procHooks process: unable to freeze: unknow

Also, there was no dedicated monitoring for ArgoCD.

What to do

First, we decided to obtain metrics for ArgoCD-related pods and create a dedicated dashboard in Datadog.

If you add the following annotation to your PodTemplateSpec, you can check the metrics on the ArgoCD dashboard provided by Datadog. For details, see this link.

Monitor your Argo CD clusters with Datadog | Datadog

Learn how you can use the Argo CD integration to ensure your Kubernetes clusters stay in sync with your application's repository.

https://www.datadoghq.com/ja/blog/argo-cd-datadog/

The metrics that can be collected are different from those in Prometheus, so see this for details.listIt will be.

The lack of scale of the ArgoCD Application Controller is quite noticeable. As we continued our investigation, we felt the need to reconsider the HA configuration of ArgoCD, so we conducted a detailed investigation based on several reference materials.

These are guidelines for HA configuration published in the official ArgoCD documentation. They are particularly detailed and cover all possibilities, but the configuration may change depending on the version, so you should proceed while checking the version and release notes.

Overview - Argo CD - Declarative GitOps CD for Kubernetes

Argo CD is largely stateless. All data is persisted as Kubernetes objects, which in turn is stored in Kubernetes' etcd. Redis is only used as a throw-away cache and can be lost. When lost, it will be rebuilt without loss of service.

https://argo-cd.readthedocs.io/en/stable/operator-manual/high_availability/

Sync 10,000 Argo CD Applications in One Shot. By Jun Duan, Paolo Dettori, Andy Anderson: This document introduces a quantitative study on the scalability of ArgoCD. It includes benchmark data on the performance when syncing 10,000 applications, as well as experimental data on the number of RepoServers and reconcile intervals, making it a very useful indicator of the load on ArgoCD's application management.

Sync 10,000 Argo CD Applications in One Shot

A quantitative study on Argo CD scalability

https://itnext.io/sync-10-000-argo-cd-applications-in-one-shot-bfcda04abe5b

Argo CD Benchmarking Series. By Andrew Lee: The most comprehensive analysis of factors affecting ArgoCD performance. It helped us identify the bottleneck in this case.

Argo CD Application Controller Scalability Testing on Amazon EKS | Amazon Web Services

Let's examine how existing mechanisms can help with scaling Argo CD to support 10,000 applications deployed across as many as 97 Kubernetes clusters .

https://aws.amazon.com/jp/blogs/opensource/argo-cd-application-controller-scalability-testing-on-amazon-eks/

Argo CD Benchmarking - Pushing the Limits and Sharding Deep Dive | CNOE

Introduction

https://cnoe.io/blog/argo-cd-application-scalability

Factors affecting ArgoCD performance

Key Components of ArgoCD

To understand Argo CD's performance, it's important to understand exactly what its key components do and how they work.

API Server（）

role:

Acts as an authentication and authorization gateway and accepts all operation requests.

Handle requests from the CLI, Web UI, or webhooks

AppProject

Operation:

Receive API requests through the UI, CLI, or Git Webhook events

Authenticate users using JWT, SSO and apply RBAC policies

Perform CRUD of Application CRD resources according to the request

2. Repo Server（）

role:

Get the source code from the Git repository and generate Kubernetes manifests

Handles a variety of source formats, including Helm, Kustomize, and plain YAML

Operation:

In response to a request from the Application Controller, retrieves source code from the specified Git repository and revision.

Generate Kubernetes manifests using Helm, Kustomize, etc.

Cache the generated manifest on the local file system

Store some information (hash value of manifest) in memory

3. Application Controller（）

role:

Application

Properly create, update, and delete resources on a Kubernetes cluster

Operation:

Application

Application

Call the Repo Server and get the latest manifest (Desired State)
Get the current resource state through the Kubernetes API (Live State)
Compare the desired state and the live state and calculate the difference
Based on the diff, determine the required resource CRUD operations

Core Metrics

Application Controller

Workqueue Work Duration Seconds

argocd.app_controller.workqueue.work.duration.seconds.bucket

This metric indicates the time it takes for the ArgoCD Application Controller to process an item in the WorkQueue. If the processing time is long, it may be a bottleneck, so it should be monitored.

Workqueue Depth

argocd.app_controller.workqueue.depth

app_operation_processing_queue

💡

app_Reconcile_queue

This is the queue where ArgoCD runs reconcile to keep manifests consistent between the Git repository and Redis. This synchronizes the state of the Git repository with the cache. If changes to the repository occur frequently, processing this queue can take a long time.

app_operation_processing_queue

This queue allows ArgoCD to keep manifests consistent between Redis and your Kubernetes cluster, and to sync and deploy your applications.

The above two queues are processed by the number of processors. For details, see element 3.

Process CPU Seconds

argocd.app_controller.process.cpu.seconds.count

This metric indicates the CPU time consumed by the Application Controller. If multiple Application Controllers are used, the average value is taken to monitor performance. Kubernetes Pod CPU metrics can also be used instead.

RepoServer

Git Request Duration Seconds

argocd.repo_server.git.request.duration.seconds.bucket

This metric indicates the time it takes for ArgoCD's Repo Server to process a Git request. If it takes a long time to access the Git repository, Sync delays may occur.

Element 1: Number of RepoServers

To improve the performance of ArgoCD, it is effective to increase the number of replicas of the Repo Server. Increasing the number of replicas will reduce the sync time of the entire application. By increasing the number of replicas appropriately, manifest generation will be parallelized and the sync process will be faster.

The article Sync 10,000 Argo CD Applications in One Shot reports that by tripling the number of replicas, the overall Sync time was reduced by one-third.

Element 2: Reconcile interval

app_Reconcile_queue

In the first round of Argo CD Benchmarking, Sync, which would normally take 30 minutes to complete, was reduced to zero over a period of 6 to 12 minutes by increasing the Reconcile interval from 3 minutes to 6 minutes.

Element 3:Adjusting the Controller Processor Count

controller.operation.processors

controller.status.processors: Number of processors for monitoring and updating the application state

appRefreshQueue(app_Reconcile_queue)

controller.operation.processors: Number of processors to execute operations on Kubernetes

appOperationQueue(app_operation_processing_queue)

The default value corresponds to 400 applications.controller.status.processors20 tocontroller.operation.processorsis set to 10. For 1000 applications,controller.status.processors50、controller.operation.processorsIt is recommended that 25 be designated.

In the second article of Argo CD Benchmarking, it was reported that doubling the number of processors reduced the Sync time by 33%. However, when increasing the number of processors, it is also important to balance it with the processing power of requests to the Kubernetes API server (Kubernetes Client QPS/Burst). If increasing the number of processors does not improve the performance, we recommend setting Kubernetes Client QPS/Burst.

Element 4: Kubernetes API Server request related

K8S_CLIENT_BURST

Twice the default setting: 67% reduction in sync time

3x the default setting: 77% reduction in sync time

K8S_CLIENT_BURST

ARGOCD_K8S_CLIENT_QPS/ARGOCD_K8S_CLIENT_BURST

Element 5: Using Helm/Kustomize and MonoRepo

Using Helm or Kustomize to generate application manifests can have a performance impact, especially in Monorepo environments, where multiple applications are included, and the generation process becomes more complex as the entire repository is manipulated.

The Repo Server clones the Git Repo locally and generates a manifest. If the manifest generation requires changes to repository files, only one parallel process is allowed per Repo Server replica, so if there are many applications in the monorepo, this will become a bottleneck and performance will decrease.

In particular, in a monorepo environment that contains 50 or more applications, this parallel processing limit often slows down processing when multiple manifest generation processes occur. If the use of Helm or Kustomize is required and a monorepo is used, the configuration must take this limit into account.

Element 6: Number of Application Controllers (number of shards)

ArgoCD's Application Controllers are deployed as StatefulSets and scale out using Sharding. By using multiple Application Controller Shards, you can balance the load and improve performance.

The ArgoCD HA documentation recommends sharding if your Application Controller is managing multiple clusters or consuming a lot of memory.

If the controller is managing too many clusters and uses too much memory then you can shard clusters across multiple controller replicas.

The Application Controller currently supports three sharding algorithms:

Lagacy

uid

Round-robin

A simple algorithm that distributes applications evenly across all shards. This algorithm is also widely used in OS scheduling, and distributes applications without considering priority. ArgoCD assigns an order to all applications and distributes them across all shards in that order. However, if multiple clusters are managed, distribution is performed on a cluster-by-cluster basis. The round-robin method is available from 2.8.x onwards.

Consistent-hashing

Distributes applications using consistent hashing. In addition to even load balancing, it has the advantage of minimizing resource relocation when shards or clusters are added or removed. Consistent-hashing also does not consider priorities and distributes on a cluster-by-cluster basis. Consistent-hashing is available from 2.12.x onwards.

As we will show using real data later, the above three algorithms are not optimal in the following cases:

When multiple clusters are managed and clusters have priority

When managing multiple clusters and each cluster has different resource amounts

When the resource usage of an application changes dynamically

In the second round of ArgoCD Benchmarking, we compared three algorithms based on CPU and memory fluctuations. As a result, Consistent-hashing showed the most stable performance.

Sharding is configured in three places:

ARGOCD_CONTROLLER_REPLICAS

Number of StatefulSet replicas

controller.sharding.algorithm

Trying out Sharding

All three sharding algorithms currently in use are designed to distribute resources equally. However, in a multi-cluster environment, they can only distribute resources on a cluster-by-cluster basis, so there are many reports of suboptimal performance when there are differences in the priorities or number of resources per cluster. Complaints about the algorithms have been raised for many years on GitHub issues.

In our initial implementation, we faced exactly this limitation of the sharding algorithm.

Comparison of sharding algorithms

A product is composed of multiple clusters. Each cluster has a different priority and the number of resources is different. The cluster list on the ArgoCD side is as follows.

dev-eks

stg-eks

prd-eks

sbx-eks

shd-eks

local

ClusterSecret

💡

shd-eks

Project/Application/ApplicationSet

shd-eks

Legacy

stg-eks

Shard数 < クラスタ数

Round-Robin

Optimal Sharding Solution

Increasing the number of shards can reduce the risk of clusters with high loads being placed on the same shard, but it cannot completely solve the problem based on cluster priority.

ClusterSecret

クラスタ数 = Shard数

In addition, the following disadvantages of Sharding must be mentioned:

If a shard is stopped, its tasks will not be handed over to other shards.

Therefore, monitoring the Pods for each shard is very important.

Deployment

Improvement results

Let's look at some key core metrics of these improvements:

Before response

Before implementing the improvement measures, we analyzed the current situation while collecting metric information. CPU usage was always maintained at 1.8 or higher, and memory was allocated sufficiently, but it was increasing to 4GB. The WorkQueue depth was always above 100, reaching 350 at its peak, and approaching that peak every two hours. We determined that the reason for the WorkQueue congestion was the long average WorkQueue processing time.

First step in enabling Shard

When Legacy Shard was enabled in v2.7, the initial switch required a Controller restart, which caused the WorkQueue depth to temporarily rise to 500, but then gradually decreased as the Controller replicas started up.

As a result, the WorkQueue processing time was reduced by more than 60%, but because multiple clusters were coexisting on the same shard, there was a large fluctuation in processing time. We confirmed that the processing time tended to increase when multiple clusters started processing at the same time, and to decrease when there was no simultaneous processing.

Second step in shard activation

Round-robin

repository i/o timeout

Additionally, we believe that the influence of the network environment was also a factor, so we took steps to increase the number of retries.

Sharding experiments and application of optimal solutions

Consistent-hashing Shard

Unknown

Even in an application with more than 2,000 resources, the sync was completed within one minute from start to finish, and all UI errors were resolved.

In terms of resource usage of each component, CPU usage was stable at an average of 900M and memory at 700MB, and overall performance remained stable even when a large number of applications were reconciled at the same time.

Other tweaks

The following adjustments are not reflected in the graph analysis above because we did not set up a controlled experiment, but we list them here because they helped us eliminate some errors:

Extending the Reconcile Period

Reconcile every 3 minutes was not required for Product A, so the reconcile period was extended to 5 minutes. Also, to reduce the load when multiple applications reconcile at the same time, a jitter setting was introduced. This allows reconcile to be performed randomly at intervals between 5m and (5m + 1m).

Dealing with Helm/Kustomize and Monorepos

In Product A, a specific repository manages the manifests of product applications, and another repository manages the manifests of applications managed by operators. Since Kustomize and Helm are used extensively in each repository, we extended the timeout of the Repo Server to allow Helm to run in parallel.

Adjusting K8S_CLIENT_QPS/BURST

We doubled the number of Controllers, Repo Servers, and API Servers.

UI Performance

Comparing the results before and after the start of the improvement efforts and manual sharding operations, we obtained the following results:

	Before improvement (FCP, LCP)	After improvement (FCP, LCP)
Application List	712ms, 1.8s	562ms, 1.48s
Application details for 800+ resources	820ms, 950ms	778ms, 848ms
Application details with 2000+ resources	645ms, 2.45s	600ms, 1.12s

Overall, we saw improvements ranging from 11% in FCP to up to 27% in LCP, with some specific pages seeing LCP improvements of over 50%.

We were only able to measure the data before the improvements once between 10:00 a.m. and 12:00 noon, a time period when multiple developers use the system at the same time, so we are unable to provide accurate figures, but we could tell that performance had improved significantly.

However, in some applications with 2000+ resources, delays in screen drawing continue to occur, and the issue of the entire screen freezing, especially when opening resource details with a large amount of information, has not yet been resolved.

api/v1/stream/applications/<app_name>/resource-tree

Many performance issues with the ArgoCD UI have also been reported on Github Issues, with performance degradation in particular when the number of resources is large being a problem.

Argo CD UI is slow with 1.5k+ applications · argoproj argo-cd · Discussion #8446

Hi all, I am wondering how everyone tunes their Argo CD deploy to support having 1.5k+ applications. I have found that the UI is pretty slow to initially load and do any filtering. I run 2 argocd c...

https://github.com/argoproj/argo-cd/discussions/8446#discussioncomment-2732700

The most promising solution at the moment is Server-side paging, which is on the Roadmap but not yet addressed.

Server-side pagination to make ArgoCD UI a rocket · Issue #14947 · argoproj/argo-cd

Summary We're on ArgoCD 2.4.2+c6d0c8b with more than 2000 Applications and suffering from an annoying issue with slow UI loading default /applications endpoint. JFY: Gzip 9 is enabled on ArgoCD...

https://github.com/argoproj/argo-cd/issues/14947

Unmanaged Resource Problems

There are two main methods for Resource Tracking in ArgoCD: Annotation and Label.

In the Label method used by default, a label is assigned to the resource managed by ArgoCD, and the same label is applied to the child resource generated by that resource. This allows it to be treated as a managed resource of the ArgoCD Application. A Product was presented at CNDT 2021, where they used Kubevela. This is a follow-up story.

I tried operating kubevela on EKS, which aims for single cluster multitenancy.

In order to move away from their old system, Ameba is currently working on a project to revamp their system to the multi-tenant EKS. To run applications on EKS, developers need a variety of knowledge, such as understanding Kubernetes, how to write manifests, and how to configure container security, and the learning costs are not inconsiderable. In this session, we will talk about how these issues were resolved by introducing kubevela. We will also introduce the measures taken to create a multi-tenant environment that mixes a wide range of Ameba systems in a single cluster.

https://cloudnativedays.jp/cndt2021/talks/1301

For example, in Product A, there is an ArgoCD application that manages more than 70 KubeVela applications, and the total number of resources generated by KubeVela reaches more than 2500. In this case, there was a significant delay in the ArgoCD UI rendering speed and the Controller reconcile process, resulting in a noticeable degradation in performance.

This resource management issue has not yet been resolved in A Product and continues to be a challenge.

Conclusion

Thank you for reading this far.

ArgoCD is easy to operate at the beginning, but as the number of applications increases, the number of things to be careful about also increases. Fortunately, thanks to the active participation of the ArgoCD community, many performance-related parameters have been added in recent years. If there is an opportunity to contribute to ArgoCD in the future, I would like to try it.

SRG is looking for people to work with us. If you are interested, please contact us here.

Recruitment information - CyberAgent SRG #ca_srg

About SRG SRG (Service Reliability Group) is working to improve reliability by promoting the introduction of SREs to the media business as a cross-sectional SRE under the vision of "improving reliability across the media business." The work is centered around the following three pillars: Consolidating and deploying the technical know-how of each business

https://ca-srg.dev/careers