Google's TPU Split: Training and Inference Chips Reshape AI Compute Wars

Google Cloud executives took the stage at Cloud Next in Las Vegas last week, unveiling the eighth-generation Tensor Processing Units with a twist. No more one-size-fits-all silicon. Instead, two distinct chips: TPU 8t for massive model training, TPU 8i for the low-latency demands of agentic AI inference. This bifurcation marks a sharp pivot, acknowledging that pre-training behemoths and real-time reasoning agents pull infrastructure in opposite directions. Google Blog called it the culmination of a decade’s work, engineered for supercomputing efficiency at scale.

TPU 8t packs a punch. A single superpod scales to 9,600 chips, delivering 121 exaflops of FP4 compute and 2 petabytes of shared high-bandwidth memory. That’s nearly three times the performance per pod over seventh-generation Ironwood, which topped out at 42.5 exaflops across 9,216 chips. Bidirectional bandwidth doubles to 19.2 terabits per second per chip; scale-out networking quadruples to 400 gigabits per second. Google claims 80% better performance per dollar year-over-year, shrinking timelines for trillion-parameter models. And efficiency? Up to twice the performance per watt versus Ironwood. Data centers now churn six times more compute per kilowatt-hour than five years back, per Amin Vahdat, senior vice president for AI and infrastructure. TechRadar Pro.

Then there’s TPU 8i. Inference workloads crave speed and persistence—think multi-turn chats, planning loops, always-on agents. This chip triples on-chip SRAM to 384 megabytes per accelerator, with 288 gigabytes of HBM. Pods grow to 1,152 chips, yielding 11.6 exaflops—almost tenfold Ironwood’s 1.2 exaflops from 256-chip setups. It tackles the ‘latency wall,’ enabling longer contexts and multi-model serving without bottlenecks. General availability hits Google Cloud later this year, supporting Gemini and beyond. Google Cloud Blog.

Why split now? AI’s dual tracks diverged. Training guzzles memory for massive batches; inference demands low latency for responsive agents. One chip couldn’t optimize both. Google’s stack—from Axion ARM CPUs hosting the TPUs, to Virgo networking fabric scaling past a million chips—ties it together. Optical circuit switches link 8t superpods; 3D torus topology holds firm. No Nvidia tax here. Customers dodge GPU premiums while gaining purpose-built hardware. VentureBeat.

Ironwood set the stage last year. That seventh-gen chip, now generally available, doubled HBM to 192 gigabytes per unit and boosted SparseCore for embeddings. It powered reinforcement learning and high-volume serving. Trillium before it—sixth-gen—trained Gemini 2.0 with 4.7 times v5e’s speed. Each leap compounds. TPU v8 pushes further, hosted on Axion for full-stack control. Analysts spot upside: TPU-to-CPU ratios tightening to 1:1 or 2:1, potentially driving millions of Axion units through 2028. The Register.

Competition heats up. Nvidia’s Rubin NVL72 joins Google Cloud soon—A5X instances blend it with TPUs. Huawei’s Ascend NPUs draw DeepSeek’s V4, hinting at bifurcated global stacks amid sanctions. Broadcom and MediaTek reportedly craft 8t and 8i, eyeing TSMC 2nm by late 2027. Google Cloud Blog. But Google’s vertical integration shines. Software like Pathways and JAX enables million-chip clusters. Persistent memory handles agent state across sessions.

Enterprises take note. Vertex AI rebrands as Gemini Enterprise Agent Platform, with $750 million for partners. Multi-gigawatt deals expand—Anthropic among them. Costs drop: 80% better inference perf/dollar means millions of agents run cheap. Training timelines halve for frontier models. Power hogs? Not anymore. Vahdat: ‘We’ve innovated across hardware and software.’ Data Center Dynamics.

Scale beyond superpods looms. Virgo fabric distributes jobs over clusters, unlocking exascale AI. Agentic workloads—reasoning, executing multi-steps—demand it. TPU 8i serves concurrent millions with sub-second responses. Training? 8t’s HBM ocean feeds the hungriest models.

Risks remain. Supply chains strain under HBM hunger—8t’s 2PB per superpod dwarfs priors. Energy grids creak, though Google’s per-watt gains help. Rollout timing matters; late 2026 GA leaves a window for rivals. Still, this dual-chip gambit positions Google squarely in the AI arms race. No generalist chips. Specialized silicon. The agent era arrives on twin engines.

Google’s TPU Split: Training and Inference Chips Reshape AI Compute Wars

Notice an error?

Ready to get started?