The $2 trillion AI infrastructure problem no one is talking about, and the engineer solving it


The AI infrastructure earnings calls of the past eight quarters have given the public a precise vocabulary for what the build-out costs in capital. Hyperscaler GPU procurement. Power purchase agreements. Real-estate footprints. The vocabulary they have not given the public is for what it costs to keep the clusters healthy on a recurring basis after the capital is spent. That line item, on close inspection, has become one of the largest hidden cost centers in the entire build-out. It is growing faster than the capital line above it.

The visible numbers in the AI infrastructure conversation describe the capital story. Hyperscaler GPU procurement is on track to cross multi-trillion-dollar cumulative spend over the current cycle. Power purchase agreements have moved into the range that historically described heavy industry. Real-estate commitments have followed. The capital narrative has been told in detail across two years of investor updates.

The operational story is less visible. It describes what it costs to keep the clusters healthy. The work is unglamorous and largely manual. GPU node failures have to be detected, triaged, and remediated. Pods have to be rescheduled around degraded hardware. Resource utilization across an accelerator fleet has to be monitored, balanced, and reported on. Each of these tasks is, in current production environments, performed by a class of engineer whose compensation is among the highest in the industry.

The scale of the bill is enormous. Industry analysts who track GPU utilization across hyperscaler fleets have, for several years, reported routine idle rates above thirty percent on production accelerators. The headcount required to keep cluster operations running has scaled with cluster size, in proportion rather than sub-proportion, in environments where the explicit goal of every infrastructure team is to break that proportionality. The operational layer, on aggregate, is one of the line items that turns the AI infrastructure thesis from a strong investment story into a structural margin problem.

The 💜 of EU tech

The latest rumblings from the EU tech scene, a story from our wise ol’ founder Boris, and some questionable AI art. It’s free, every week, in your inbox. Sign up now!

The work to address it has, until recently, sat inside the bespoke automation tooling of the largest operators, accessible only to the engineers who built it. That is starting to change. Shashidhar Bhat, a software engineer in the big-data infrastructure organization at ByteDance, has spent the past two years producing a body of work that maps directly onto the operational layer the rest of the industry has been describing as a problem.

The pieces, individually, look like ordinary infrastructure components. Custom device plugins for finer-grained accelerator scheduling. Observability tooling built on top of NVIDIA’s Data Center GPU Manager. Autonomous pod rescheduling logic that reacts to hardware degradation without human escalation. Each is the kind of thing that gets shipped quietly inside an internal infrastructure team. Taken together, they describe the operational layer that the industry has been outsourcing to site reliability engineers, ported into software and hardened against production load.

The scale at which Bhat’s work runs is part of what makes it credible as a reference architecture. ByteDance, parent of TikTok, operates one of the largest Kubernetes deployments in the world. Its clusters run on hundreds of GPU nodes processing roughly one petabyte of data each month. Bhat’s internal framework, an agent-based automation system called OpenSkill, has reduced GPU idle time by thirty-five percent across that environment, against a baseline that included the usage spikes characteristic of large-scale recommender training and content distribution.

A thirty-five percent figure is, by the operational standards of the field, large. Hyperscaler-class operators have for years been chasing single-digit-percentage improvements in idle rates, on the reasoning that single-digit improvements at hyperscaler volumes pay back in eight figures. A reduction at the scale Bhat reports is the kind of result that, when it appears in production at a peer company, is closely held. The fact that it has been reported at all is part of why the wider operator community has begun paying attention.

The other half of Bhat’s recent work has appeared on the open-source side. He has been a contributor to Kubewharf Katalyst, the resource management framework maintained jointly by ByteDance and the broader Kubernetes community. The Katalyst project is one of the few in the cloud-native ecosystem to address the joint scheduling of CPU and GPU resources under load. The design proposals Bhat has filed against the project have moved the discussion in directions that closely parallel his internal work. The convergence between an engineer’s internal production work and external open-source contributions is the rare kind of pattern the maintainer community recognizes as substantive rather than promotional.

The third leg of the body of work is Carbon-Kube, the open-source Kubernetes scheduler Bhat released this past December alongside an IEEE paper co-authored with Sathwik Rao Sirikonda, also at ByteDance. The scheduler is a distinct project from his internal ByteDance work and addresses the carbon-emissions dimension of cluster operations rather than the headcount dimension. The project ships with a citation file, a published benchmark methodology, and reproducible scripts. The contribution is methodologically rigorous in a way that most internal infrastructure tooling never bothers to be.

The combined picture is what makes the case worth making at the industry level. The AI infrastructure operational layer is a cost center the size of a medium economy. The work to address it has been happening quietly inside the largest companies, accessible only to their internal teams. That is changing, in part because of the work of operators like Bhat, whose contributions span internal production deployments, external open-source maintenance, and research-grade publications under his own name.

The argument that the operational layer is the next major margin frontier in AI infrastructure is, on the strength of the work that has shipped in the past year, hard to dismiss. Cluster operators in the next two to three years will need to decide whether to build their own answer or to adopt one of the open-source ones now becoming available. The composition of that answer will reshape the operational margin of every team running production AI workloads.



Source link

Leave a Reply

Subscribe to Our Newsletter

Get our latest articles delivered straight to your inbox. No spam, we promise.

Recent Reviews


Robot mowers on a yard

Maria Diaz/ZDNET

Follow ZDNET: Add us as a preferred source on Google.


The perfect robot mower for you is not nearly as fancy and feature-heavy as you may think. I’ve said it before, and I’ll say it again: it’s not the lawn mower, it’s all about the yard. A robot mower may be a market leader with top-of-the-line specs and still not be a good fit for your yard.

Here’s the great news: There’s a perfect robot mower for almost any yard. As someone who’s tested numerous types of robot lawn mowers, I’ve learned that many of the specs that brands market as groundbreaking are simply not vital for most shoppers. A mostly flat, fenced-in 0.10-acre yard doesn’t need the power that a hilly, sectioned, unfenced one-acre yard does.

Also: I tested the Ferrari of robot mowers for a month – here’s my verdict

If you’re looking to choose the best mower for your home, be sure to check out ZDNET’s robot mower buying guide

Here’s what you don’t need to stress over when buying a robot mower

Eufy E15 Robot Mower

Maria Diaz/ZDNET
For yards with… Best robot mower type Examples
No fences A wired boundary is best, but a great GPS/RTK robot mower can stick to the map you make with it. Yardcare E400, Mammotion Luba 3
Fences A LiDAR robot mower that can be dropped to mow with little setup and learn its map as it navigates. Eufy E15, Ecovacs Goat A3000
A lot of trees A LiDAR or wired boundary mower, since trees can interfere with satellite signals. Husqvarna iQ series (optional wire, EPOS)
Unbordered garden beds A GPS/RTK robot mower that you can set up to avoid flower beds when mapping. Mammotion Luba 3, Husqvarna iQ Series
Bordered garden beds A LiDAR, GPS, or wired boundary robot mower works for these yards. If you choose a wired boundary, you may have to bury wire around the flower beds, unless the borders are tall enough for the mower to avoid. Mammotion Yuka, Navimow Series H
pets A LiDAR robot mower that can adjust its navigation in real-time in reaction to its surroundings. Mova LiDAX Ultra 2000, Segway Navimow i2
Hills and uneven terrain An AWD robot mower capable of handling steep slopes, regardless of the navigation type. Mammotion Luba 3, , Husqvarna iQ

1. Don’t focus on: ‘AI-powered’ or other marketing buzzwords

Segway Navimow X3 Series robot mower

Maria Diaz/ZDNET

Artificial intelligence (AI) has surpassed the popularity of acid-wash jeans in the 80s and Baby G watches in the early 2000s. And tech companies — including robot lawn mower manufacturers — are capitalizing on its appeal.

Most of these “AI-powered” or “intelligent mowing” terms are vague, geared to grab shoppers’ attention with buzzwords. That doesn’t mean that the robots don’t use AI to navigate, however. 

The key is to find out how the robot uses AI to its benefit, and whether that will meet your AI expectations. 

Also: This robot mower took care of my lawn for months – and it’s currently $300 off

AI algorithms typically process data captured by the robot’s hardware to help it make quick decisions and adjustments. For example, a robot lawn mower may have a set of sensors and cameras to capture its surroundings. The robot’s processor then uses AI to convert that information into actionable data, so it knows whether to swerve to avoid an obstacle or slow down around a retaining wall.

Instead, look for: The navigation tech under (and on) the hood

Instead of AI and other buzzwords, you should focus on matching the robot lawn mower’s hardware and navigation system to your yard. This includes whether the robot uses RTK (Real-Time Kinematic) for positioning, and whether it features LiDAR, cameras, and sensors. 

Then look at real user reviews to assess how accurately the robot mower maps and how well it performs around various types of obstacles.

There’s no blanket rule for robot mowers, but most do well with the following guidelines.

2. Don’t focus on: Premium extras

Yardcare E400 robot lawn mower

Maria Diaz/ZDNET

Skip the premium extras that don’t match your yard. You really don’t need the most advanced robot mower; you need the one that will best handle your lawn. 

Most US homeowners have mostly flat lawns, simple rectangular layouts, minimal obstacles, and small yards. Yet some of the most popular mowers advertise features that don’t match this, and you don’t want to spend an extra few hundred dollars on advanced features that won’t deliver a noticeable difference in your yard.

Instead, look for: Only as much as you need

Do you have a mostly flat lawn with no fences and need a robot that can navigate to several sections separated by paths? Then you can skip AWD models and commit to superior mapping and navigation features, like multi-zone intelligence.

Also: I let a modular yard care robot mow my lawn – here’s my verdict after a month

Similarly, if you have a yard with dense trees covering most of it, it’s safe to skip the RTK models and go for LiDAR or boundary wire options instead. 

3. Don’t focus on: Flashy app features

Mammotion Luba 2 robot mower path

The path lines created by the Mammotion Luba 2, as captured by our Bink Outdoor camera, is one flashy app feature I can’t quit.

Maria Diaz/ZDNET

Any dependable robot lawn mower requires an equally reliable mobile app to let you use it effectively. However, manufacturers market many flashy app features that end up being unnecessary for many users. 

Don’t make app features the deciding factor unless it’s something you genuinely care about. Many users don’t rely on voice control to run their mowers and don’t mind using a separate app for their robot rather than integrating it into an existing home automation system.

Also: I let a smart planter maintain itself for 2 months – here’s the result

A robot lawn mower with mediocre navigation and cutting performance can still have a flashy app — all while leaving behind missed patches or taking longer to finish mowing.

Instead, look for: The features you’ll actually use

Most robot mower users keep them running on a schedule to get the lawn-cutting chore off their minds. The majority of the most popular models offer basic features beyond scheduling, such as remote start and stop, basic mapping, automatic rain delay, and theft protection. 

It’s easy to find robot lawn mowers with these features, but if you’re looking for anything beyond that, just be sure that the feature is worth it, especially if you’re paying extra for that model.

Also: I’ve tested robot mowers for years – here’s my expert advice for every yard type

An example of a flashy app feature that is completely unnecessary, but I love having? The Mammotion’s pattern cutting. I can select the cutting pattern I want on the Mammotion app, whether I want lines or checkered, but I can also have the robot cut in custom patterns, like letters and numbers. I don’t care for mowed letters in my yard, but I like that it always has that freshly mowed checkered patterned with no effort from me. 

4. Don’t focus on: Cutting system extras

Segway Navimow X3 Series robot mower

Maria Diaz/ZDNET

The cutting width and system specs are important, as they can determine whether a robot can cover a given area in a day. However, most robot mowers use similar multiple-blade mulching systems. 

Unlike traditional lawn mowers with large blades for aggressive cutting in a single pass, robot mowers typically feature a set of small blades that constantly spin. Because of this, robot mowers trim smaller amounts of grass with each pass than a traditional mower, but they also cut more frequently and leave behind smaller grass clippings that decompose naturally.

Also: I powered my 3,000-sq-ft home with an EcoFlow battery in a blackout – here’s how it kept my AC on

Because the robot mowers have a smaller, compounding cutting system, the real-world differences between the cutting systems from one brand to another are often smaller than you’d expect. Other issues, like poor navigation, will be glaringly obvious before small differences in blade design.

Instead, look for: Cutting width and yard size

The average US yard would benefit more from navigation quality, consistency, and connectivity than blade design. Instead, you should focus on matching the mower to your yard size.

The robot’s capacity is measured in how many acres it can cover in a day. Among other features, this is calculated based on your robot’s battery size and cutting width. Essentially, most users want a robot that can mow an entire yard in a day, so you can set it and forget it and always come home to a mowed yard. You get this by getting the appropriate robot for your yard size.





Source link