xAI’s Colossus: Conquering GPU Scale Challenges in the AI Compute Race

xAI's Colossus supercomputer in Memphis scales to 200,000-plus GPUs, powering Grok amid utilization struggles, power controversies, and partnerships like Cursor. Built at breakneck speed, it highlights AI's compute edge—and its pains.
xAI’s Colossus: Conquering GPU Scale Challenges in the AI Compute Race
Written by Lucas Greene

Elon Musk’s xAI has pushed GPU clusters to extremes. Colossus, its Memphis supercomputer, started with 100,000 Nvidia H100s. Built in 122 days. Doubled to 200,000 GPUs 92 days later. Now mixes 150,000 H100s, 50,000 H200s, and 30,000 GB200s. The world’s largest single coherent AI training setup.

Speed defined the project. Suppliers quoted 18 to 24 months for 100,000 GPUs. xAI demanded six. They repurposed an old Electrolux factory. Input power: 15 megawatts. Needed: 150. Rented generators. Leased a quarter of U.S. mobile cooling capacity. Power swings during training dropped 50% in 100 milliseconds. Generators couldn’t cope. Solution: Tesla Megapacks with custom software to smooth fluctuations. Networking? Cables for 100,000 GPUs. Teams worked four shifts, 24/7. Musk slept in the data center, cabling himself.

“The general principles of first-principle thinking apply to software, hardware, anything really,” Musk said. “Often, we were told something was impossible, but once we broke it down into its constituent elements, we could solve those.” Now Colossus trains Grok. Supports X. Aids SpaceX. Plans call for 1 million GPUs.

Scale Brings Brutal Efficiency Hurdles

Raw size isn’t enough. Utilization lags. xAI President Michael Nicolls admitted in a staff memo that Model FLOPs Utilization (MFU) hit just 11%. Industry averages: 35-45%. Target: 50%. GPUs sit idle without perfect orchestration. Power. Cooling. Networking. All must align. Traditional Ethernet delivers 60% throughput at scale. xAI’s Nvidia Spectrum-X Ethernet hits 95%. Zero latency degradation. No packet loss. BlueField-3 SuperNICs per GPU. RoCE v2 congestion control.

One ex-EPA official called unpermitted gas turbines a violation. “That is a violation of the law,” said Bruce Buckheit. Over a dozen run in Southaven, Mississippi. Fuel Colossus 1 and 2. Emit pollutants near schools. Residents report asthma flares. Noise. xAI seeks permits for 41 more. Could spew 6 million tons of greenhouse gases yearly. 1,300 tons of other pollutants. Company didn’t comment. Mississippi claims turbines are “portable.” EPA disagrees.

Power guzzles 250 megawatts now. 35 gas turbines provide 420 megawatts capacity. 208 Megapacks back it up. First substation: 97 days to build. Solaris supplies more. Wastewater recycling plant—world’s largest ceramic membrane bioreactor—processes graywater. Protects aquifers. Supplies surplus to locals.

Cooling mixes liquid and air. Supermicro 4U systems. Closed-loop water. 119 chillers handle 200 megawatts. Racks pack 64 GPUs each. 1,500 total. Memory bandwidth: 194 petabytes/second. Storage: over one exabyte.

But challenges persist. Fault tolerance across thousands of nodes. Checkpoint I/O. Cosmic ray bit flips. Mismatched firmware. Tangled cables. xAI engineers debug on the fly. Sustained runs last months.

Compute Flexes Across Musk’s Empire

Colossus powers more than Grok. SpaceX taps it too. Recent deal: Cursor, hot AI coding startup. xAI rents tens of thousands of chips. Cursor trains models there. SpaceX holds $60 billion buy option. Or pays $10 billion fee. Colossus equals one million H100s in power, SpaceX claims. Two Cursor execs jumped to xAI, reporting to Musk.

“Combining Cursor’s product and distribution to expert software engineers with SpaceX’s Colossus supercomputer,” the companies said. Cursor’s valuation soared—from $2.5 billion to $29.3 billion. Now eyes $60 billion.

Wall Street toured Colossus recently. SpaceX shows off amid IPO prep. Merged with xAI earlier. Losses hit $4.94 billion in 2025 on $18.67 billion revenue. AI infra bets.

Training ramps. Seven models parallel: Imagine V2, variants. 1T, 1.5T, 6T, 10T parameters. 10T rivals Anthropic’s biggest. $1.5 billion cost. Colossus 2: first gigawatt cluster. Vertical stacks cut latency. $659 million investment.

NAACP sued over Memphis site. Pollution fears. Temps rise. Few jobs added. But Memphis eyes AI hub status.

xAI rewrites rules. Others hoard GPUs—95% idle industry-wide. xAI scales aggressively. Fixes bottlenecks. Turns factory into frontier compute beast. Grok advances. Musk’s web tightens. Compute wars intensify.

Subscribe for Updates

EmergingTechUpdate Newsletter

The latest news and trends in emerging technologies.

By signing up for our newsletter you agree to receive content related to ientry.com / webpronews.com and our affiliate partners. For additional information refer to our terms of service.

Notice an error?

Help us improve our content by reporting any issues you find.

Get the WebProNews newsletter delivered to your inbox

Get the free daily newsletter read by decision makers

Subscribe
Advertise with Us

Ready to get started?

Get our media kit

Advertise with Us