Debate on "Is the Infrastructure Sector Really Necessary?"
Ishikawa Kumo (Service Reliability Group (SRG) of the Media Headquarters)@ishikawa_kumo)is.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
This article is a translation of two articles that were popular within the Chinese infrastructure engineer community.
While I was reading a Chinese-language IT blog, I happened to come across an article arguing that "infrastructure departments are unnecessary," as well as a counter-argument. While some of the arguments were weakly supported, as someone who works in a similar department, I felt I could grasp the trends and challenges in China's IT industry, as well as some insights for infrastructure engineers, so I decided to translate them.
Issue raised:Is an infrastructure department really necessary?
Author: Machi
He has worked as a PM for DB engine development and Cloud Computing product development at Tencent and Huawei, and is currently a senior infrastructure engineer at PayPal Sweden.
Over the past 20 years, Chinese Internet companies have achieved great things. Infrastructure divisions (also known as technology platform divisions, operation and development divisions, and architecture platform divisions) have made significant contributions as the backbone of Internet companies' technological foundations. However, with technological advances, these divisions' old experiences are increasingly no longer applicable in the cloud era, and they are generally unable to address new challenges. These two factors combine to raise the question: Is the infrastructure division still worthwhile?
The value of the infrastructure division is to provide technology to the business division.
The Infrastructure division was established with the goal of consolidating the company's infrastructure and architecture talent and providing technical support to the business divisions.
To achieve this goal, the Infrastructure division must meet two conditions:
Possessing excellent technology
Leading business units and facilitating the use of technological excellence
Unfortunately, over the past few years, the infrastructure sector has been scoring increasingly poorly on these two criteria. Below, we list some common issues in the infrastructure sector and invite our readers to discuss them.
Problem 1: No core technology
In their annual presentations, the infrastructure department often emphasizes that they've built a platform that provides business departments with general-purpose features like big data management, operations, observability, and enhanced security.
Some infrastructure technology experts mock business developers as "CRUD programmers" and boast about their technical expertise.
However, upon closer inspection, it turns out that the infrastructure experts are actually nothing more than OSS assemblers. Their primary job is to find some OSS on GitHub, run some benchmark tests, and deploy the selected technology in their own data centers. During this process, sensible developers simply tweak the community version's parameters and deploy it as is. Smart developers modify it to prevent others from taking over. Even smarter developers even copy the selected OSS to prevent others from wanting to take over. I call these people "parameter tuning craftsmen," "modification craftsmen," and "imitation craftsmen," respectively.
Parameter Tuning Expert
Although parameter tuning experts are often looked down upon, they are actually the most reliable. At the very least, if an expert leaves, the employer can find someone to take over. Let's take a look at Tencent Cloud's CLB documentation.
Layer 7 load balancing is primarilySTGW is a 7-tier load balancing service developed by Tencent based on Nginx that supports a large number of concurrent connections. It handles traffic for many of Tencent's internal 7-tier businesses, including Tencent News, Tencent Games, and WeChat.
This CLB is based on Nginx, so what exactly is "in-house developed"? Perhaps the development of an installation script?
Magical Modification Craftsman
AliSQL
💡
AliSQL is a database engine based on MySQL 5.6 developed by Alibaba Cloud. The last release was in May 2018. The AliSQL Github repository is currently unmaintained, and PRs from the community have been completely ignored. As a result, there are many critical voices in the Github issues.https://github.com/alibaba/AliSQL/issues/112
Alibaba Cloud's OSS project is popular among Chinese IT engineers.Open Source for KPIs" is sometimes mocked.
Imitation craftsman
For example, Tecent Cloud's CDN NWS.
All Tecent Cloud CDN edge nodesNWS is a high-performance HTTP server developed independently by Tencent, and is said to offer clear improvements in both functionality and performance compared to common servers such as Nginx.
In fact, NWS is nothing more than an imitation of Nginx, and the performance gains were spurious and only due to the fact that the Nginx being compared had suboptimal parameters.
💡
Translator's note: There is currently no information available to support the benchmark results for NWS and Nginx.
Problem 2: Not delivering business value
Over the past few years, parameter tuning, modification, and imitation have become the most popular technologies in the big data field. Initially, everyone was using Hadoop, but the infrastructure department compared Impala, Presto, Spark Streaming, and Storm to select a technology, and then claimed to the CTO that they had a thorough understanding of big data technology. They then set about building a big data platform. After securing a budget and working on it for a year, all they had created was a Web UI, which they called an "integrated big data management platform." In cases with larger staffing levels, they even created a user management platform, but it often didn't even support SSO.
A few years later, as the Hadoop ecosystem gave up in the West, these craftsmen turned to Clickhouse as a lifeline, and the same formula was repeated again.
Over the past decade, at least hundreds of infrastructure departments (or big data platform departments) across China have been doing this. When these craftsmen get together, they engage in childish competitions such as, "You need 600 machines, but I can do it with 300," or, "It takes me an hour to process 5TB of data, but it takes you 70 minutes, which is 14% faster." This is like two middle school students competing with each other, saying, "You can only pee two meters, but I can pee three meters."
In reality, business units don't care about the number of machines they have. They really care about whether the big data platform can create business value. For example,
Can the big data platform automatically ingest new data sources?
Can it be easily integrated with BI tools?
Is granular access control and access auditing possible?
Can you accurately bill business units?
Can you comply with personal information protection requirements?
However, due to the operational culture of the infrastructure department, they rarely pay attention to these issues. They focus on their comfort zone, tweaking the performance of their machines, and sometimes they report strange results like, "Three engineers spent six months saving the company 50,000 yen, the equivalent of half a machine."
Athena
Problem 3: Lagging technological ideas
Over the past decade, the software engineering industry has undergone a series of slow but steady technological innovations, making the current software industry very different from that of 15 years ago. However, these innovations have not had the dramatic impact of the iPhone or ChatGPT, and many workers have not felt the progress. This includes decision makers and engineers in the architecture and platform departments. Their philosophical lag has led to technological lags, causing them to lose their technological advantage over the business departments, and in some cases, even being overtaken.
Stick to solved scalability problems
W11
To be honest, 20 years ago, when LAMP configurations were common and many teams struggled with the C10K problem, BAT companies (Baidu, Alibaba, Tencent) explored large-scale service engineering practices, which were valuable in supporting their business growth.
But 20 years later, scalability is a solved problem, and the fundamental approach remains largely the same.
Use microservices, each with its own independent resources
Keep your web servers stateless and scale them out/in automatically
Use load balancing to handle external requests
Use containers to avoid manual deployment of binaries
Use object storage that is separate from the compute nodes, rather than using file systems that are difficult to scale
Ensuring that each Pod is replaceable
Use an auto-scaling distributed database and add cache as needed
Use message queues to bridge the processing power gap between different systems
Infrastructure as Code
Build a CI/CD pipeline that includes code, deployment files, and secrets to ensure code quality.
Providing a standard observability dashboard for each microservice
Partition users in the business layer as needed
Essentially, these best practices are sufficient for most workloads. And there are tools for each step that any engineer can implement in their own team.
Meanwhile, when infrastructure experts talk about high concurrency, they tend to focus on the minutiae of threads and processes, or use vague terms like "active-active configurations." If you poke around even a little, they'll lash out, asking, "Have you ever handled a double eleven?" "Do you know how many machines I manage?" "Do you have a billion users?" To be honest, all this talk amounts to a religious ritual.
Stick with ClickOps
LY.COM
This type of ClickOps was acceptable 15 years ago, but the industry has now moved to the concept of IaC, where engineers should communicate directly with resource providers using IaC code, eliminating operational intermediate steps.
💡
Translator's notes:
LY.COM is a Chinese travel booking website.
A Platform engineer at LY.COM responded to the above:
The above process has already been significantly improved, and now the HBase cluster is created automatically by the Kubernetes Operator after workflow approval.
LY.COM: In fact, the backend for LY.COM's HBase cluster service application integrates the configuration parameters of the application workflow with the HBase cluster configuration template to automatically launch a cluster instance using Kubernetes. This automated process works very smoothly, and there is no longer a situation where an operations engineer receives the ticket and runs a secret script to create a cluster.
Problem 4: Intentional obstruction to DevOps
In addition to technological delays, infrastructure departments sometimes intentionally hinder their teams' efficiency for their own benefit (budget and survival). For example, some departments exclusively manage all infrastructure resources under various pretexts, forcing developers to submit an application every time they need an object storage bucket. Even if a developer needs a database, they must submit an application, and even updating a single line of SQL must wait for DBA approval.
In CI/CD pipelines led by infrastructure departments, only the CI portion is functional, and the CD portion is intentionally separated. This is done under the pretext of "ensuring the stability of the production environment," but in reality, they fear that if they open up CD, business development teams will realize that they can do high-quality work efficiently without intermediate procedures, and their department will become unnecessary.
Suspicion of vendors
Even within the same company, the relationship between business and infrastructure departments is actually like that between users and vendors. Naturally, vendors tend to exclude other vendors. Even if there are external vendors offering excellent observability tools, the infrastructure department will not use them and will instead try to build its own, difficult-to-use ELK. Even if there is a reliable external RDS, the infrastructure department will try to build its own MySQL.
Before the popularity of Lark and DingTalk, at least 200 of China's top 500 companies had teams dedicated to developing internal IM tools. Many human resources were wasted reinventing the wheel. These IM tools, supposedly tailored to the specific needs of each company, were difficult to use, had poor security, and often led to entire teams leaving the company.
💡
Translator's notes:
Lark and DingTalk are Chinese enterprise IM/Chat tools.
Furthermore, when a gaming company develops its own HR system, a food delivery company develops its own code management tool, or a telecommunications equipment company develops its own travel management system, they are essentially using their employer's funds to expand the influence of their department because they have free time.
Strained relationships with business divisions
Many infrastructure departments claim that their value lies in their ability to better understand the needs of the business, since they're internal staff. This sounds reasonable at first glance.
However, in reality, for many Internet companies, technology isn't a core competency; general software technology is sufficient, and there aren't many specialized needs. Conversely, if a business department has special needs that aren't generally supported in many industries, there may be a problem with the software they select.
A typical example is Zabbix using MySQL to store time-series data, which places many requirements on MySQL. However, these requirements should not be met, because MySQL was not originally designed for processing time-series data.
Another typical example is the fixed IP address requirement for Kubernetes, which is common in China. In reality, businesses that require a fixed IP address shouldn't use Kubernetes; if they insist on using containers, they should deploy them directly on virtual machines. However, when infrastructure departments catered to this need, the requirement spread throughout the country, even affecting the Cilium project. The Cilium project did not respond, and the issue remained unaddressed for three years.https://github.com/cilium/cilium/issues/17026
💡
Translator's note: After this Chinese article was published, we received a response from a Chinese committer right away.
The investment team's conservative thinking
Many companies' infrastructure departments are derived from their operations teams and place a high value on reliability. While this is a strength, it also creates a conservative culture. A common saying among veteran operations teams is, "Don't touch what's working."
If you take a look at the technology stacks of many leading IT companies, you'll see a large amount of outdated software, such as MySQL 5.7, CentOS, Python 2.X, and GCC 4.9, that is no longer maintained. This software poses a significant risk in terms of compatibility and security, as patches are not provided even when security holes are discovered. While infrastructure departments should be responsible for updating such outdated software, they often prioritize availability and are the most reluctant to upgrade technology.
Photos cited in the original article
Counterargument: The infrastructure sector isn't dead, it's just fading away
Author: Leo Li
Former Chief Architect at DiDi, CTO at MeiQia, and currently Co-Founder and CEO of ClapDB.com
Translator's note: ClapDB is a startup company that offers a serverless DB SaaS product for data analysis.
Yesterday, Paypal's Mr. Uma wrote an article titled "Is the Infrastructure Division Really Necessary?" in which he questioned the significance of the Infrastructure Division. As someone who has been involved in infrastructure for many years, I would like to respond to the issues raised in the article from a relatively objective standpoint. And with this article, I would like to pay my respects to the veterans of the infrastructure division.
What is the value of the infrastructure sector?
As Mr. Ma stated, the infrastructure division exists to provide technology to the business division, and to do so, the following two prerequisites are necessary:
Possessing excellent technology
Leading business units and facilitating the use of technological excellence
The "excellence" of technology varies depending on the situation
However, there is no single standard for "excellence." What makes excellent technology? Technology always involves trade-offs. For example, even if a certain technology can provide the highest performance, if its functions are too simple or its costs are too high, it will be meaningless for many companies.
The standards of excellence differ in each business scene. Companies are not competing on the same field, so the infrastructure departments that support corporate businesses should not have unified goals.
Take, for example, the goal of achieving maximum concurrency. Salesforce, a world-famous name in the SaaS industry, has been using Oracle as its underlying storage for many years, bucking the trend known as "de-IOE" (eliminating IBM, Oracle, and EMC storage). Salesforce's concurrency is also low, but that doesn't mean its technology isn't superior. It just means that its technology is appropriate for the Salesforce domain and doesn't meet the requirements of other domains. Cats don't need to race penguins; penguins are good swimmers.
"Leading" business units?
The infrastructure department exists as a cross-functional department, playing a role in reducing overlapping and incorrect development between departments. In fact, in the Internet and mobile Internet era, Internet companies focused on "growth" as their top priority, and in the process hired large numbers of inexperienced programmers and repeatedly tried and tested new business areas. If cross-functional departments did not provide "guidance" and "standardization," the business could grow too "wild."
No core technology in the infrastructure sector?
Are they really "parameter tuning craftsmen," "magical modification craftsmen," or "imitation craftsmen"?
In Mr. Ma's article, he discusses whether the infrastructure department has core technologies and categorizes engineers into "parameter tuning craftsmen," "modification craftsmen," and "imitation craftsmen."
I completely agree with this point.But what does that matter?
Many Chinese internet companies are modeled after American companies and conduct business in much the same way as their American counterparts. Therefore, there is no problem with using nearly the same technology as their American counterparts, as this is the most reliable and least risky technology choice.CTC() the original meaning is to explore America (stepping stones) and cross the river.If you can copy the business model, the tech stack won't be an issue.
💡
Translator's Note:
Copy to China: This refers to Chinese companies imitating the business models of successful foreign companies. The degree of imitation varies, from simply offering directly competitive services to imitating the UI, trademark appearance, or similar-sounding names. While this practice is often criticized outside of China, including Japan, it is widely recognized in China as a model of success by both private companies and the government.
Furthermore, major American Silicon Valley companies, such as Twitter (now X), Facebook (now Meta), and Airbnb, have grown primarily on a foundation of open source technology. When problems arise in the process, they are resolved through "parameter tuning" or "modification." However, "imitation" is a symptom of the big company disease.
"Imitation" is a symbol of the big business disease
As a company grows, the "measure" for individual promotion and evaluation changes, shifting from contribution to "contribution x difficulty." However, as a company grows and business stabilizes, most contributions become less noticeable. As a result, employees tend to over-promote themselves in order to gain better evaluations and promotion opportunities.
The phenomenon of "imitation" arises as a company grows, when team goals diverge from the overall company's goals. Even if a company introduces KPIs and OKRs, this problem cannot be prevented. In reality, "imitation" is often justified for some noble reason, and problems that can be solved through imitation are often identified. The choice between imitation and transcendence is up to the individual, but without this gray area, innovation would be impossible. You can't understand the essence of anything until you've tried it. They say the devil is in the details, but until you actually try it, you can't tell which details are important and which are the devil.
In short, the big company disease is not necessarily a bad thing; rather, the resources and profits of large companies provide a certain amount of room for innovation. However, while investing involves risk, it is also true that if you don't invest, you won't get anything.
What's wrong with modifying OSS?
OSS is known to be developed by developers such as:
Software that large companies open source for their own commercial purposes to gain market share and beat competitors (Android, Chrome, Kubernetes, VSCode, etc.)
Companies run OSS and acquire paying customers through open sourcing (ElasticSearch, Nginx, MySQL, Cassandra, etc.)
Software provided by open source foundations (GNU, Linux, etc.)
Generally, type 1 follows their own company's roadmap and doesn't seem to care much about niche needs. Type 2 intentionally weakens enterprise features (otherwise, how would they sell a commercial version?). Type 3 takes a "welcome to your PR" stance and doesn't feel like a service like commercial software.
Internet companies around the world encourage their employees to modify (modify) OSS. So Facebook brought HBase initiatives in-house, and Google submitted a cgroup patch to upstream Linux. In this way, modifying OSS is not a sin, but rather technical immaturity is the problem.
💡
Translator's note: The original text's expression contains a wide range of nuances, including not only technical immaturity but also organizational structure, culture, and perception.
How does the infrastructure function deliver business value?
Infrastructure functions typically operate as cross-functional teams and often lack business revenue to calculate their own ROI. However, infrastructure demonstrates value through more general metrics such as availability, scalability, and business iteration speed. Cats are good at catching mice, even if they don't have the latest nail polish. In Internet companies, technology serves the business, and the vast majority of companies are not FAANG. It's unfair and unrealistic to expect security guards at a typical company to invent Star Wars gear.
Delays in technological concepts
Copy from Google
Why are we obsessed with scalability?
The infrastructure department is responsible for availability and scalability, and anything else would appear to be neglecting their core business.
Is the infrastructure sector a barrier to technological progress?
Proving their value within an organization is an essential task for any team, especially cross-functional teams. Infrastructure teams only demonstrate their value when problems arise and they are held accountable. Therefore, it's natural for the development infrastructure to be conservative overall, which is consistent with the role and purpose of the department. Infrastructure teams often act as regulators of the business teams' demands and actions, which can make their efforts seem unrewarded. However, they still provide a vital constraint within the company.
organizational inertia
This is the inertia found in every organization throughout history; most organizations never progress or innovate from within; they are driven by external forces. For example, the Soviet Red Army, faced with German tanks, still pinned its hopes on the Cossacks. Infrastructure teams are like any other team; their mission is to support the business. If they achieve that mission, there is little motivation to reinvent themselves. But this is not the fault of the Cossacks (an old ideology).
The barrier to entry for non-tech companies isn't technology
Many internet companies in the Chinese market are not technology-driven and do not need a continuously evolving infrastructure department. In fact, it is more practical to keep costs down when the business is not growing rapidly.
Why do infrastructure departments exclude outside vendors?
Infrastructure teams are often in-house technology providers, and it's natural for them to exclude external competitors. After all, external vendors take much the same approach as infrastructure teams: they build on open source systems with tweaks, parameter tuning, and even imitation. It's difficult for infrastructure teams to abandon their weapons and buy someone else's. Both users and vendors have their own problems.
Fate to Vanish
Old soldiers never die, they just fade away.
New productive forces will inevitably create new production relations.
The infrastructure team itself is"Trying to achieve growth by using excess resources"It's a microcosm of internet companies of the past.
However, every era comes to an end, and like any excess resource during a period of growth, the infrastructure sector will gradually be reduced. Internet companies will shift from high investment to rational investment with a high ROI focus. They will shift from a desire for autonomy and control to a desire for control at a more reasonable cost.
💡
Translator's note:They try to achieve growth by using excess resources."This is an example of a growth strategy commonly seen in Chinese internet companies.In the mobile internet era, which began with the spread of the internet and the mobile market, many companies have overinvested resources and packed as many features as possible into a single app, aiming to lock in users and rapidly expand market share. This also aims to prevent users from abandoning the app and moving on to other services.
For example, even single-function apps like weather apps are often turned into all-purpose apps that add news, short videos, food delivery, and payment functions. This approach is known as "barbaric growth," which emphasizes scale over efficiency, and reflects the culture and structure of Chinese internet companies. (It's a phenomenon that has also been seen in Japan recently.)
Infrastructure teams will move in one of two directions:
Embrace the cloud and reduce labor costs
Embrace the market, bring your products to market and join the competition
By reassessing the role of infrastructure from an ROI perspective and repositioning it, unnecessary people will be removed from the team. Products of the times will naturally evolve with the times.
Conclusion
These two articles discuss the significance and future prospects of the "infrastructure department," which supports a company's technological foundation, from different perspectives. The reality is that the role that infrastructure departments play within a company is changing as technology evolves. As Ma points out, clinging to old ways risks being left behind by the times. On the other hand, as Li states, infrastructure departments still have technological value and an important role to play in providing stable support for the business.
In the future, infrastructure departments and infrastructure engineers will be faced with challenges such as reevaluating their roles and finding ways to balance efficiency with the provision of value while responding to the evolution of cloud technology and changing business needs.
By the way, I was surprised by the Chinese IT industry's concept of "Copy to China" and its goal to compete with Silicon Valley. I would like to continue translating any good Chinese articles I come across.
SRG is looking for people to work with us.
If you're interested, please contact us here.