HOME/Articles/A discussion on "Is the infrastructure sector really necessary?"

A discussion on "Is the infrastructure sector really necessary?"

2024/7/31 19:182024/9/20 8:53

Media Headquarters Service Reliability Group (SRG)@ishikawa_kumo)is.

#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.

This article is a translation of two articles that were popular within the Chinese infrastructure engineer community.

By chance, while I was reading a Chinese IT blog, I came across an article arguing that "infrastructure departments are unnecessary" and an article refuting the argument. Although some of the arguments were weakly supported, as someone who works in a department like the infrastructure department, I felt that I could understand the trends and issues in the Chinese IT industry, as well as some enlightenment for infrastructure engineers, so I decided to translate it.

Question raised: Is the infrastructure sector really necessary?Problem 1: No core technology Problem 2: Not providing business value Problem 3: Lagging behind in technological ideas Problem 4: Intentional obstruction to DevOps Counterargument: The infrastructure sector never dies, it just fades away What is the value of the infrastructure sector?No core technology in the infrastructure sector?How does the infrastructure function deliver business value?Delay in technological ideas Is the infrastructure sector a bottleneck in technological progress?Fate of Vanishing Conclusion

Issue raised:Is an infrastructure department really necessary?

Author: Machi

He has worked as a PM for DB engine development and Cloud Computing product development at Tencent and Huawei, and is currently a senior infrastructure engineer at PayPal Sweden.

LinkedIn: https://www.linkedin.com/in/ma-chi-a1ab3a1/

Original article URL:https://mp.weixin.qq.com/s/yalmoDbY75_Pz9PCzpjiPQ

In the past 20 years, Chinese Internet companies have made great achievements. The infrastructure division (or technology platform division, operation and development division, architecture platform division) has made great contributions as the backbone of the technology foundation of Internet companies. However, with technological advances, the old experience of these divisions is increasingly inapplicable in the cloud era, and they are generally unable to deal with new problems. The combination of these two factors raises the question: is the infrastructure division still worthwhile?

The value of the infrastructure division is to provide technology to the business division.

The Infrastructure division was established with the goal of consolidating the company's infrastructure and architecture talent and providing technical support to the business divisions. To achieve this goal, the Infrastructure division must meet two conditions:

Possess excellent technology

Leading business units and facilitating the use of technological excellence

Unfortunately, over the past few years the infrastructure sector has been scoring increasingly poorly on these two fronts. Below I will list some common issues in the infrastructure sector and invite readers to discuss them.

Problem 1: No core technology

In the annual presentations of the infrastructure department, it is customary to emphasize that they have built a certain platform and provide general-purpose functions such as big data management, operations, observability, and security enhancement to the business department. Some infrastructure department technical experts ridicule business department developers as "CRUD programmers" and are proud of their own technical accumulation. However, if you open the lid, you will find that the experts in the infrastructure department are actually just craftsmen who assemble OSS. Their main job is to find some OSS on GitHub, do some benchmark tests, and deploy the technology selection in their data center. In this process, the sensible ones directly adjusted the parameters of the community version a little and deployed it as it is. The smart ones modify it to make it difficult for others to take it over. The even smarter ones imitate the selected OSS to make others unwilling to take it over. I call these people "parameter tuning craftsmen", "magic modification craftsmen", and "imitation craftsmen", respectively.

Parameter Tuning Expert

Although parameter tuning craftsmen are often looked down upon, they are actually the most reliable craftsmen. At least, when a craftsman leaves, the employer can find someone to take over. Let's take a look at Tencent Cloud's CLB documentation.

Layer 7 load balancing is primarilySTGW is a 7-layer load balancing service developed by Tencent based on Nginx that supports large-scale concurrent connections. It handles the traffic of many of Tencent's internal 7-layer businesses (Tencent News, Tencent Games, WeChat, etc.).

This CLB is based on Nginx, so what exactly did they "develop"? Perhaps the installation script?

Magical Modification Craftsman

AliSQL

💡

Translator's Note: AliSQL is a database engine based on MySQL 5.6 developed by Alibaba Cloud. The last release was in May 2018. Currently, the AliSQL Github repository is not maintained, and PRs from the community have been completely neglected. As a result, there are many criticisms in the Github issues.https://github.com/alibaba/AliSQL/issues/112Alibaba Cloud's OSS project is a hot topic among Chinese IT engineers.Open Source for KPIs" It is sometimes ridiculed as such.

Imitation craftsman

For example, Tecent Cloud's CDN NWS.

Tecent Cloud's CDN edge nodes are allTencent has adopted NWS to provide optimal service performance to its customers. NWS is a high-performance HTTP server independently developed by Tencent, and is said to have clear improvements in both functionality and performance compared to general servers such as Nginx.

In fact, NWS is nothing more than an imitation of Nginx, and the performance gains were spurious and only due to the fact that the Nginx it was compared to had suboptimal parameters.

💡

Translator's note: There is currently no information available to support the benchmark results for NWS and Nginx.

Problem 2: Not providing business value

In the past few years, the field of big data has been best known for tuning, remodeling, and imitation of parameters. At first, everyone used Hadoop, but the infrastructure department made technology selections by comparing Impala, Presto, Spark Streaming, and Storm, and claimed to the CTO that they fully understood big data technology. Then, they started building a big data platform. After securing a budget and working on it for a year, all they had was a Web UI, which they named an "integrated big data management platform." In cases with more people, they even created a user management platform, but it often didn't even support SSO.

A few years later, as the Hadoop ecosystem gave up in the West, these craftsmen grabbed Clickhouse as a lifeline, and the same formula began again.

Over the past decade, there have been at least hundreds of infrastructure departments (or big data platform departments) across China doing this. When these craftsmen get together, they start childish competitions like, "You need 600 machines, but I only need 300," or, "It takes me an hour to process 5TB of data, but you take 70 minutes, 14% faster." This is like two middle school students competing with each other, saying, "You can only pee 2 meters, but I can pee 3 meters."

In reality, the business doesn't care about how many machines you have. They really care about whether the big data platform can create business value. For example,

Can the big data platform automatically ingest new data sources?

Can it be easily integrated with BI tools?

Is granular access control and access auditing possible?

Can you accurately bill business units?

Can you comply with personal information protection requirements?

However, due to the operational culture of the infrastructure department, they rarely pay attention to these issues. They focus on tweaking the performance of machines in their comfort zone every day. Sometimes they report strange results like "Three engineers spent six months saving the company 50,000 yen, the cost of half a machine."

Athena

Problem 3: Lagging behind in technological ideas

Over the past decade, the software engineering industry has undergone a series of slow but steady technological innovations, and the current software industry is very different from 15 years ago. However, these technological innovations have not had a dramatic effect like the iPhone or ChatGPT, so many workers have not felt the progress. This includes decision makers and engineers in the architecture platform department. The lag in philosophy has caused the technology to lag behind, and they have lost their technical advantage over the business department, and in some cases, it has even been reversed.

Stick with solved scalability problems

W11

To be honest, 20 years ago, LAMP configurations were common and many teams were struggling with the C10K problem. During that time, BAT (Baidu, Alibaba, Tencent) companies were exploring large-scale service engineering practices, which was valuable in supporting their business growth.

But 20 years later, scalability is a solved problem, and the fundamental approach remains largely the same.

Use microservices, each with its own independent resources

Keep your web servers stateless and scale them out/in automatically

Use load balancing to serve external requests

Use containers to avoid manual deployment of binaries

Use object storage that is separate from the compute nodes, rather than using file systems that are hard to scale

Ensuring that each Pod is replaceable

Use an auto-scaling distributed database and add cache when needed

Use message queues to bridge the processing power gap between different systems

Infrastructure as Code

Build a CI/CD pipeline that includes code, configuration files, and secrets to ensure code quality.

Providing a standard observability dashboard for each microservice

Partition users at the business layer as needed

Basically, these best practices are enough for most workloads. And there are tools for each step, which any engineer can implement in their team. In contrast, when talking about high concurrency, infrastructure experts tend to focus on the finer points of threads and processes, or use vague terms like "active-active configurations." If you ask them even a little more, they will get angry and ask, "Have you ever done a double eleven?", "Do you know how many machines I manage?", "Do you have a billion users?". To be honest, all of this talk is basically a ritual performed by believers.

Stick with ClickOps

LY.COM

This kind of ClickOps was acceptable 15 years ago, but the industry has now moved to the concept of IaC, where engineers should communicate directly with resource providers using IaC code, removing the operational middleman.

💡

Translator's notes:

LY.COM is a Chinese travel booking site. LY.COM's Platform engineer responded by saying: The process above has already been significantly improved, and now the HBase cluster is created automatically by the Kubernetes Operator after workflow approval.

LY.COM: In fact, the backend of LY.COM's HBase cluster service application process integrates the configuration parameters of the application workflow with the HBase cluster configuration template to automatically launch a cluster instance using Kubernetes. This automation process works very smoothly, and there is no longer a situation where an operations engineer receives the ticket and runs a secret script to create a cluster.

Problem 4: Intentional obstruction to DevOps

In addition to technological delays, infrastructure departments sometimes intentionally hinder the efficiency of their teams for their own benefit (budget and survival). For example, some departments exclusively manage all infrastructure resources under various pretexts and make developers submit applications every time they need an object storage bucket. If a developer needs a database, they also need to submit an application, and even updating a single line of SQL must wait for DBA approval. In infrastructure-led CI/CD pipelines, only the CI part is often functioning and the CD part is intentionally separated. This is done under the pretext of "ensuring the stability of the production environment," but in reality, they are afraid that if they open up CD, the business development teams will realize that they can do high-quality work efficiently without intermediate procedures, and their department will become unnecessary.

Vendor skepticism

Even if they are departments of the same company, the relationship between the business department and the infrastructure department is actually like that between users and vendors. Naturally, vendors tend to exclude other vendors. Even if there are vendors that provide excellent observability tools externally, the infrastructure department will not use them and will try to build their own difficult-to-use ELK. Even if there is a reliable RDS externally, the infrastructure department will try to build MySQL by itself. Before Lark and DingTalk became popular, at least 200 of China's Top 500 companies had teams to develop internal IM tools. A lot of human resources were wasted reinventing a poor wheel. These IM tools, which are said to be tailored to the special needs of the company, are difficult to use, have weak security, and often cause entire teams to quit while using them.

💡

Translator's notes:

Lark and DingTalk are Chinese enterprise IM/Chat tools.

Furthermore, when gaming companies develop their own HR systems, food delivery companies develop their own code management tools, and telecommunications equipment companies develop their own travel management systems, they are all using their employers' money to expand the reach of their departments because they have free time on their hands.

Distorted relationships with business units

Many infrastructure departments claim that their value is that "because we are in-house people, we understand the needs of the business well." At first glance, this sounds plausible. However, in reality, for many Internet companies, technology is not a core competency, and general software technology is sufficient, and there are not many special needs. Conversely, if a business department has special needs that are not generally supported in many industries, there may be a problem with the software selection. As a typical example, if you use MySQL in Zabbix to store time series data, many requirements will be placed on MySQL, but these requirements should not be met. This is because MySQL was not designed for processing time series data in the first place. Another typical example is the fixed IP need of Kubernetes, which is often seen in China. In fact, businesses that need a fixed IP should not use Kubernetes, and if you really want to use containers, you should deploy them directly on virtual machines. However, as a result of the infrastructure department pandering to this need, this request spread throughout the country and even spread to the Cilium project. The Cilium project did not respond to this request, and the issue has been left unattended for three years.https://github.com/cilium/cilium/issues/17026

💡

Translator's note: After this Chinese article was published, we received a response from a Chinese committer right away.

The investment team’s conservative mindset

Many companies' infrastructure departments are derived from the operations team and place a high emphasis on reliability. While this is a strength, it also creates a conservative culture. A common saying among operations veterans is "don't touch what's working." If you look at the technology stacks of many leading IT companies, you will see a lot of old software that is no longer being maintained, such as MySQL 5.7, CentOS, Python 2.X, and GCC 4.9. This software poses a high risk in terms of compatibility and safety because patches are not provided even when security holes are discovered. Although the infrastructure department should be responsible for updating such old software, they often take the most passive stance on technology upgrades because they prioritize availability.

原文で引用された写真 — Photos cited in the original article

Counterargument: The infrastructure sector never dies, it just fades away

Author: Leo Li

Former Chief Architect at DiDi and CTO at MeiQia, currently Co-Founder and CEO of ClapDB.com

LinkedIn: https://www.linkedin.com/in/leo-li-50474338/

Original article URL:https://mp.weixin.qq.com/s/E1fEdwoJ4PQ5YH3f52GqNw

💡

Translator's note: ClapDB is a startup company that offers a serverless DB SaaS product for data analysis.

Yesterday, Paypal's Uma-san wrote an article titled "Is the Infrastructure Division Really Necessary?" in which he questioned the significance of the Infrastructure Division. As someone who has been involved in infrastructure for many years, I would like to respond to the issues raised in this article from a relatively objective standpoint. And with this article, I would like to pay tribute to the veterans of the infrastructure division.

What is the value of the infrastructure sector?

As Mr. Ma stated, the infrastructure division exists to provide technology to the business division, and to do so, the following two prerequisites are necessary:

Possess excellent technology

Leading business units and facilitating the use of technological excellence

The "excellence" of a technology varies depending on the scene

However, there is no single standard for "excellence." What is excellent technology? Technology always involves trade-offs. For example, even if a certain technology can provide the best performance, if its functions are too simple or its costs are too high, it is meaningless for many companies. The standards of excellence are different in each business scene. Companies are not competing on the same field, so the infrastructure departments that support companies' businesses should not have unified goals.

Take the goal of maximum concurrency for example. Salesforce, a world-famous name in the SaaS industry, has been going against the trend of "de-IOE" (removing IBM, Oracle, EMC storage) for years and using Oracle as its underlying storage. Salesforce also has low concurrency, but that doesn't mean that Salesforce's technology is not good. It's just that the technology is appropriate in the Salesforce domain and doesn't fit the requirements of other domains. There's no need for cats to race penguins; penguins are good swimmers.

"Leading" business divisions?

The infrastructure department exists as a cross-sectional department and plays a role in reducing overlapping and incorrect development between departments. In fact, in the Internet era and the mobile Internet era, Internet companies set "growth" as their most important indicator, and in the process hired a large number of inexperienced programmers and repeated trial and error in new business areas. If the cross-sectional department did not "guide" or "standardize", the business could grow too "barbaric".

No core technology in the infrastructure sector?

Are they really "parameter tuning craftsmen," "magical modification craftsmen," or "imitation craftsmen"?

In your article, you discuss whether the infrastructure sector has core technology and categorize engineers into "parameter tuning craftsmen," "modification craftsmen," and "imitation craftsmen." I completely agree with this point.But what does that matter? Many Chinese Internet companies are modeled on American companies and conduct almost the same business as their American counterparts. Therefore, there is no problem in using almost the same technology as their American counterparts. This is the most reliable and least risky technology selection.CTC() the original meaning is to explore America (the stepping stones) and cross the river.If you can copy the business model, the tech stack won't be an issue.

💡

Translator's Note: Copy To China: This refers to Chinese companies imitating the business model of a successful foreign company. The degree of imitation varies, from simply offering a directly competitive service to imitating the UI, trademark appearance, and similar-sounding names. While this practice is often criticized in countries outside of China, including Japan, in China it is widely recognized as a model of success by both private companies and the government.

Furthermore, major companies in America's Silicon Valley, such as Twitter (now X), Facebook (now Meta), and Airbnb, have essentially grown on a foundation of open source technology. And when problems arise in the process, they are solved by "parameter tuning" or "modification." However, "imitation" is a sign of the big company disease.

"Imitation" is a symptom of the big business disease

As a company grows, the "measure" of individual promotion and evaluation changes, from contribution to "contribution x difficulty". However, as a company grows and business stabilizes, most contributions become unnoticeable. As a result, employees try to force themselves to increase the difficulty in order to get better evaluations and promotion opportunities. The phenomenon of "imitation" occurs when a company grows and the team's goals diverge from the company's overall goals. Even if a company introduces KPIs and OKRs, this problem cannot be prevented. In reality, "imitation" is often justified for some good reason, and problems that can be solved by imitation are found. It is up to the individual to decide whether to imitate or transcend, but without such a gray area, innovation would not occur. You cannot understand the core of anything until you have tried to imitate it once. They say that the devil is in the details, but you never know which details are important and which are the devil until you actually get your hands dirty. In short, big company disease is not necessarily a bad thing, but rather the resources and profits of large companies give innovation a certain amount of room. However, while investments are inherently risky, it is also true that if you don't invest, you get nothing.

What's wrong with modifying OSS?

OSS is known to be developed by developers such as:

Software that large companies open source for their own commercial purposes in order to gain market share and beat competitors (Android, Chrome, Kubernetes, VSCode, etc.)

Companies run OSS and acquire paying customers through open sourcing (ElasticSearch, Nginx, MySQL, Cassandra, etc.)

Software provided by open source foundations (GNU, Linux, etc.)

In general, type 1 will follow their own roadmap and will not be very interested in niche needs. Type 2 will intentionally weaken enterprise features (otherwise, how would they sell commercial versions?). Type 3 has a "welcome to your PR" stance and does not feel like a service like commercial software. Internet companies around the world encourage their employees to modify OSS (magic modification). And Facebook has brought HBase initiatives in-house, and Google has submitted a cgroup patch to upstream Linux. In this way, modifying OSS is not a sin, but technical immaturity is the problem.

💡

Translator's note: The original text contains not only technical immaturity, but also a wide range of nuances such as organizational structure, culture, and perception.

How does the infrastructure function deliver business value?

Infrastructure functions generally operate as horizontal functions and often do not have business revenue to calculate their own ROI. However, infrastructure functions demonstrate their value through more general metrics such as availability, scalability, and business iteration speed. A good cat is a good cat to catch a mouse, even if it does not have the latest nail polish. In Internet companies, technology serves the business, and the vast majority of companies are not FAANG. It is unfair and unrealistic to ask security guards in a normal company to invent Star Wars gear.

Delay in technological ideas

Copy from Google

Why are we obsessed with scalability?

The infrastructure department is responsible for availability and scalability. Saying anything other than that would seem like neglecting your core business.

Is the infrastructure sector a bottleneck in technological progress?

Proving your value within an organization is a vital task for any team, especially cross-functional ones. Infrastructure functions can only prove their worth when they are held accountable for problems that arise. So it makes sense that the development infrastructure as a whole is conservative, which is in line with the function's role and purpose. Infrastructure teams often act as regulators of the business teams' demands and actions, which makes them seem unrewarded for their efforts. However, they still provide a vital constraint within the company.

Organizational inertia

This is the inertia of every organization throughout history, most organizations never progress or innovate from within, they are driven by external forces. For example, the Soviet Red Army still pinned its hopes on the Cossacks when faced with German tanks. Infrastructure teams are like any other team, their mission is to support the business of the company. If they are achieving that mission, there is little incentive to reinvent themselves. But this is not the fault of the Cossacks (an old ideology).

The barrier to entry for non-tech companies isn't technology

Many Internet companies in the Chinese market are not technology-driven and do not need a constantly evolving infrastructure department. Rather, it is more practical to keep costs down when the business is not growing rapidly.

Why are infrastructure departments excluding outside vendors?

Infrastructure teams are often in-house technology providers, and it is natural for them to exclude external competitors. After all, external vendors take much the same approach as infrastructure teams: they build on open source systems with tweaks, parameter tuning, and even imitation. It is difficult for infrastructure teams to give up their weapons and buy someone else's. Both users and vendors have their own problems.

Fate of Vanishing

Old soldiers never die, they just fade away.

New productive forces always give rise to new production relations. The infrastructure team itself is"Trying to achieve growth by using excess resources"It's a microcosm of Internet companies of the past.

However, every era comes to an end, and the infrastructure sector, like any excess resource in an era of growth, will be gradually scaled back. Internet companies will shift from high inputs to rational inputs with a high ROI focus. They will shift from a desire for autonomy and control to a desire for control at a more reasonable cost.

💡

Translator's Note:They try to use excess resources to achieve growth.”This is one example of a growth strategy commonly seen among Chinese Internet companies.) and the spread of the Internet, in the mobile Internet era, many companies have overinvested resources and packed as many functions as possible into one app, aiming to capture users and rapidly expand market share. This is also aimed at preventing users from abandoning the app and switching to other services. For example, even for single-function apps such as weather apps, it is common in China to see all-purpose apps that add news, short videos, food delivery, and even payment functions. This approach is called "barbaric growth," which emphasizes expansion over efficiency, and reflects the culture and structure of Chinese Internet companies. (It is a phenomenon that has also been seen in Japan recently.)

Infrastructure teams will move in one of two directions:

Embrace the cloud and reduce labor costs

Embrace the market, bring your products to the market and join the competition

By reevaluating and reassigning the role of infrastructure from an ROI perspective, unnecessary people will be removed from the team. Products of the times will naturally evolve with the times.

Conclusion

These two articles discuss the significance and future prospects of the "infrastructure division" that supports the technological foundations of companies from different perspectives. The reality is that the role that infrastructure divisions play within companies is changing with the evolution of technology. As Ma points out, if you stick to the old ways, you risk being left behind by the times. On the other hand, as Li says, the infrastructure division still has technological value and an important role to play in stably supporting business.

In the future, infrastructure departments and infrastructure engineers will be faced with challenges such as reevaluating their roles and finding ways to balance efficiency with the provision of value while responding to the evolution of cloud technology and changing business needs.

By the way, I was surprised by the concept of "Copy To China" in the Chinese IT industry and its goal to compete with Silicon Valley. I would like to continue translating any good Chinese articles I come across.

SRG is looking for people to work with us. If you are interested, please contact us here.

Recruitment information - CyberAgent SRG #ca_srg

About SRG SRG (Service Reliability Group) is working to improve reliability by promoting the introduction of SREs to the media business as a cross-sectional SRE under the vision of "improving reliability across the media business." The work is centered around the following three pillars: Consolidating and deploying the technical know-how of each business

https://ca-srg.dev/careers

SRG runs a podcast where we chat about the latest hot topics in IT and books. We hope you will enjoy listening to it while you work.

Tech-Talk with SRG #1 by Tech-Talk with SRG

Tech-Talk with SRG is a podcast where members of CyberAgent's horizontal SRE organization, the Service Reliability Group (SRG), chat and introduce the latest hot IT technologies and books. If you have any questions or comments, we would appreciate it if you would tweet them with the hashtag #ca_srg. ■ Links SRG Portal Site https://ca-srg.dev Recruitment Information https://ca-srg.dev/careers Music generated by Mubert https://mubert.com/render

https://podcasters.spotify.com/pod/show/cyberagent-srg/episodes/Tech-Talk-with-SRG-1-e26fugj