HOME/Articles/Multi-tenant design and reflections on Ameba Platform

Multi-tenant design and reflections on Ameba Platform

2024/12/12 13:302024/12/16 20:49

Media Headquarters Service Reliability Group (SRG)@ishikawa_kumo)is.

#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.

This article summarizes the design for multi-tenant support carried out on Ameba Platform from 2023 to 2024 and a review of it.

Background and Issues Reasons why migration was not possible Time to rethink the design Clarification of absolute requirements Design Policy Security Isolation Perspective Security Level Exception handling Approach and details 1. Authentication and Authorization Integration Strategy 2. Implementing Network Security 3. Protecting shared resources, volumes and backups 4. Monitoring and APM Security Looking back Fine-grained is not available The operational reality of each tenant is more complicated Network Policy is very useful Conclusion

Background and Issues

Ameba Platform is a platform with AmebaBlog and the infrastructure (mainly EKS) of peripheral services at its core. It was launched around 2020 with the aim of unifying the development deployment flow and simplifying the technology stack, and the central goal of the project was to integrate many services into a more efficient and manageable infrastructure.

Although the core parts were successfully migrated in the early stages of the project, authentication services and services with high security levels could not be migrated to Ameba Platform due to EKS and other security issues. These services had to continue to be operated on the existing infrastructure or in a separate EKS environment.

Reasons why migration was not possible

Istio Authorization Policy

The challenge of integrating authentication and authorization systems The ideal design would have been to use our in-house authentication and authorization infrastructure to fully integrate Kubernetes RBAC, AWS IAM, and all developer tools and monitoring and operations tools, but we did not have the leeway to do so in 2020-2021, when we were in the early stages of platform development.

Istio and EKS compatibility issues

Security Groups for Pods（SGP）

Time to rethink the design

In 2023, about three years after the start of the platformization, we had more human resources and more technical options, so we had the opportunity to fundamentally reconsider the multi-tenant design and restart the migration process. First, the stability of the VPC CNI and SGP operations was proven, making it possible to use SGP. In addition, VPC CNI began native support for NetworkPolicy from 2022, minimizing vendor dependency. In 2023, I joined this project immediately after joining CyberAgent, and designed authentication integration and multi-tenancy on AWS and EKS.

Clarification of absolute requirements

The following requirements were established as absolute non-negotiables in the project redesign:

Complete communication isolation according to security level

Pod ↔ Pod, Pod ↔ AWS resource

Centralized authentication of all communications through a common authentication infrastructure

AWS、EKS、Datadog、ArgoCD、Github Teams

Strict access control for AWS and Kubernetes resources

Utilizing RBAC, ABAC, etc.

Design Policy

In addition, the following principles were used as design guidelines:

Minimize dependency on specific vendor products

Aim to achieve this using Kubernetes and AWS default functions as much as possible

Ensuring simplicity and maintainability of authentication and authorization processes

Security Isolation Perspective

Security Level

AWS services are independent of each other, but have an equal relationship. In order to achieve security separation, it is important to use IAM ABAC to identify the security level of each service and control communication based on that. Resource tags are one of the best ways to identify the security level of a service.

On the Ameba platform, we took into account the characteristics of microservices and categorized services as follows:

Protected Services: Services that have high security requirements and require strict management

Non-Protected Service: Services with relatively low security requirements

The following principles have been established for communication control:

Services with high security level (Protected)

Inbound communication is strictly restricted

Outbound communication is relatively free

Services with low security level (Non-Protected)

Relatively loose restrictions for both inbound and outbound

Exception handling

In a traditional multi-tenant model, it is common to set strict restrictions on each tenant and restrict access to only their own resources. However, in real-world operations, there are more complex requirements. For example, a team managing authentication services must deal with services with different security levels on a daily basis.

Even for services with high security levels, it is necessary to allow exceptional inbound communication, such as by publishing some APIs. When allowing exceptional communication, we have adjusted the security level of some of the targets.

The following communication restrictions have been set on the Ameba platform.

Non-Protected services cannot access Protected services

Protected services can access Non-Protected services

Protected services that expose specific endpoints will be demoted to non-protected services.

Approach and details

The specific implementation approach for the multi-tenant design evolved around four key areas:

1. Authentication and Authorization Integration Strategy

Unification of authentication infrastructure

Ameba Platform adopted the following integrated approach to achieve centralized management of authentication and authorization:

AWS, Datadog, and Github authentication: Leveraging our in-house SAML infrastructure

Node SSH access: Using the company's LDAP infrastructure

ArgoCD: Using OAuth2 and OIDC on Github Teams

Authentication integration was one of the most complex aspects of this project due to the limitations of our in-house authentication infrastructure. While some things could have been centralized with OIDC, we had to adopt a wide variety of methods due to the lack of OIDC functionality in our in-house authentication infrastructure.

In particular, in the case of ArgoCD, we were unable to directly integrate with the in-house authentication infrastructure due to security concerns over SAML in dex, so we integrated OIDC via GitHub Teams. Since GitHub Teams is already integrated with SAML, there is no need to take inventory of users.

Please see our previous article regarding issues with ArgoCD SAML integration.

ArgoCD SSO Integration SAML2.0 Edition - CyberAgent SRG #ca_srg

This is Kumo Ishikawa (@ishikawa_kumo) from the Service Reliability Group (SRG) of the Technology Headquarters. The #SRG (Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, and is responsible for improving existing services, launching new services, and contributing to OSS. This article is about

https://ca-srg.dev/b812f5c7612746f88f5200f5c5b566f0#block-1a23ef370409441797ce950632da44b8

ABAC: Role

developer、admin

<product>-<tenant>-<role>

ameba-A-developer

ameba-A-secure

We have also created a system that makes it possible to add other roles and corresponding attributes depending on operational circumstances.

ABAC: Policy

In order to achieve advanced access control, we have introduced Resource Tag-based attribute management.

ameba.jp/protected=true

ameba.jp/sensitive=true

ameba.jp/exposed=true

StringNotEquals

Use NotActions to distinguish between Admin and Developer

StringNotEquals Condition

StringEquals Condition

Although ABAC can control most AWS services, there are some services that cannot be controlled with Resource Tags. In such cases, you will need to handle them individually with the corresponding Condition.

For example, the following services and APIs:

More information can be found in the Service Authorization documentation.

Actions, resources, and condition keys for AWS services - Service Authorization Reference

Provides a list of the actions, resources, and condition keys supported by each AWS service that can be used in an IAM policy.

https://docs.aws.amazon.com/service-authorization/latest/reference/reference_policies_actions-resources-contextkeys.html

EKS RBAC

developer/admin

ClusterRoleBinding

RoleBinding

2. Implementing Network Security

Pod ↔ Inter-Pod communication

ConfigurationValues

PodSelector

When applying a tagging strategy similar to IAM ABAC, please note the following:

ameba.jp/protected=true

An Expose Tag is required when exposing a Pod in a Protected Namespace to the outside world.

Hierarchical Namespace

namespaceSelector

ameba.jp/exposed: "true"

Also, by using the inheritance feature of Hierarchical Namespace, it is no longer necessary to create it in each Child Namespace as shown above.

Communication between Pods and AWS resources

SecurityGroupsForPod(SGP)

There are two steps to using SGP.

Change the following settings in vpccni:

Use SGP's CustomResource

SGPs carry several risks.

There is a limit to the number of Pods that can be applied.

Branch ENI

Trunk ENI

Branch ENI

Pod startup speed will be slower.

Branch ENI

Potential for conflict with other network vendors

This has now been resolved, but in the past there was a conflict with Istio (although this is unconfirmed). AWS support recommends IAM authentication, so SGP should be considered as a last resort.

If you are interested in the details of SGP, please refer to our previous article.

Deep Dive into VPC CNI: A Deep Analysis of IPAMD and Security Groups For Pods - CyberAgent SRG #ca_srg

https://ca-srg.dev/922372aee0364004830578a09799d858

3. Protecting shared resources, volumes and backups

Shared Resources

For shared resources such as ECR and S3, which are managed centrally in a Shared account, we have implemented access control using ResourceTag in the Shared account. For services where it is difficult to control Resource Tags (such as S3), we also use an identification method using the prefix of the resource name.

Storage Tier

All EBS Volume operations can be controlled by Resource Tags, with one exception when used with EKS:

Kubernetes PersistentVolume (PV)

backup

All AWS Backup APIs such as create/copy are controlled by Resource Tags.

4. Monitoring and APM Security

When integrating with monitoring tools, especially Datadog, we tried to integrate with the authentication infrastructure, but there were issues with APM's permission control.

Although there are restrictions on APM itself, we found that fine-grained permission control is difficult. If you divide it into granularities like the one in the figure below, you can only handle it by blocking everyone without specific permissions from seeing APM, or by allowing everyone to see it.

Therefore, we adopted an approach that involves masking sensitive data before it enters APM.

Looking back

It has been about half a year since the entire Ameba Platform environment was updated, but due to a lack of human resources, the migration has not yet begun.Trials so farI will summarize what I felt and what I thought after looking at examples from other companies.

Fine-grained is not available

The two separate developer and admin roles do not allow detailed permission settings for users who can only access certain services and resources.Such a powerful roleI wonder if it's okay to give it to them.

Since there is a one-to-one relationship between IAM roles and roles in the company's authentication infrastructure, it is practically difficult to increase the number of roles as needed. I am still thinking about what to do. If you have any good ideas, please let me know.

The operational reality of each tenant is more complicated

<product>-<tenant>-<role>

For example, the roles of the authentication infrastructure used by some tenants were used for multiple purposes, and each tenant managed both member management and collaboration management. Since such roles could not be integrated into the Ameba Platform, it would likely become an incomplete multi-tenant system.

Furthermore, our authentication infrastructure has a reference limit between roles, so even if we wanted to integrate member management into the Ameba Platform, it is unclear at this time to what extent this would be possible.

Network Policy is very useful

Cloudflare Tunnel

Conclusion

I wrote this article while recalling the whole process of multi-tenant support at Ameba. Looking back, my memory is hazy in many places, and the carefully written documentation at the time saved me many times.

This article is just one example within CyberAgent, but we hope it will be of use to you.

SRG is looking for people to work with us. If you are interested, please contact us here.

Recruitment information - CyberAgent SRG #ca_srg

About SRG SRG (Service Reliability Group) is working to improve reliability by promoting the introduction of SREs to the media business as a cross-sectional SRE under the vision of "improving reliability across the media business." The work is centered around the following three pillars: Consolidating and deploying the technical know-how of each business

https://ca-srg.dev/careers