Voice assistants charm with fluid intonation and natural cadence these days. But scratch the surface. Their reasoning lags years behind text counterparts. ChatGPT’s voice mode traces to early 2023 models. Grok sticks with 2023-era tech. Google Gemini keeps its voice updates under wraps. Users hear progress. They get stagnation underneath.
Logan Kugler laid it bare in Communications of the ACM. Voice systems prioritize low latency. Frontier models like Opus chew through 2-5 seconds just to start outputting. Add speech-to-text, text-to-speech, network delays. Total wait hits 3-7 seconds. Nobody stomachs dead air on a call.
“Models like Opus can take two to five seconds before they produce output,” said Abhishek Sharma, senior technical product marketing manager at Telnyx. “Once you layer in speech-to-text, text-to-speech, and network hops, you end up with three to seven seconds of dead air. No one tolerates that on a call. So teams reach for smaller, faster models, and the gap people notice between voice and text is largely a result of that compromise.”
Sharma nailed the tension. Low latency clashes with deep reasoning. You can’t expose a thinking process through a speaker.
Arvind Sundararaman, formerly head of field CTO at Snowflake and now at Databricks, put it sharper. “Frontier reasoning models are large, compute-intensive, and optimized for depth rather than real-time responsiveness.” Voice demands budgets of a few hundred milliseconds per turn. Text users wait patiently. Voice breaks the conversation.
Data shortages compound the problem. Text data floods open sources. Clean voice data? Scarce. Voice guzzles tokens too—tens per second against text’s 3-4. Shyam Gollakota, a University of Washington professor, explained the hit. “Getting a pre-trained text model to learn audio often results in the final model losing some of the reasoning abilities and controllability of the original text model.” Native multimodal training drags down smarts.
Companies cascade systems. Speech handles input. Text reasons. Output converts back. It boosts intelligence. But latency swells. Full-duplex talk adds chaos—spotting when to chime in amid pauses, noise, overlaps. “Unlike text where you type in the whole question to get a response, with voice the model needs to know when to respond as well,” Gollakota said.
Biological cues matter. Breath timing. Micro-pauses. Vocal effort. Cadence signals safety, presence. Jennalyn Ponraj, a synthetic speech researcher at Delaire.ai, warned of stakes. “In high-stakes AI voice deployments, including emergency services, elements such as breath timing, micro-pauses, vocal effort, and cadence play a biological role in signaling safety and presence.” Miss them, and trust erodes. Prosody shifts by culture too. Warmth in one accent reads as stress in another.
The gap feels structural. Voice craves speed, continuity. Reasoning thrives on compute, depth. Hardware pushes help—Groq, Fireworks AI, Cerebras speed inference. But voice can’t afford text’s pauses. Big firms hoard audio datasets. Open efforts lag.
Recent moves hint at change. OpenAI retired GPT-4o from ChatGPT on February 13, 2026, but clarified voice runs a distinct model, per their help page. Users mourned emotional depth in older voices, as The Guardian reported. Newer ones felt formulaic. Some pinned hopes on speed benchmarks like Claude Haiku. Sharma allowed: “If someone gets a strong model running at Claude haiku-level speeds, voice changes quickly for a lot of use cases.”
Sundararaman captured the bind. “Real-time voice interaction rewards speed and continuity; advanced reasoning rewards compute and depth. Bridging those two without degrading one side is fundamentally harder than improving either independently.”
X chatter echoes frustration. One user noted OpenAI’s voice model, two years old, faces viral tests it flunks—tricks text handles fine. Another called out Spotify ads with dated AI explanations. Husk’s clips expose voice fumbling basics like “Where’s the meatloaf?” Text sails through.
Progress brews elsewhere. Open-source VoxCPM 2 clones voices in 30 languages, runs realtime on consumer GPUs, scores higher than ElevenLabs on benchmarks. But chatbots cling to cascaded setups. Multimodal natives promise more. OpenAI eyes a new voice model early 2026, per Ars Technica forums buzzing the news.
Voice integrates deeper—cars, phones, wearables. Sound alone sells intelligence. Yet old brains limit it. Users sense the mismatch. Complex tasks stay typed. Until inference matches voice’s tempo without slashing capability, keyboards rule deep talks. Voice? It dazzles. It disappoints.


WebProNews is an iEntry Publication