xAI's Grok 4 Heavy scored firsts on Humanity's Last Exam and ARC-AGI V2 by scaling parallel test-time compute and reinforcement learning. The approach shifts AI progress from pre-training to inference-time reasoning but raises fresh questions about cost, control and oversight.