Engineers at Cloudflare spent weeks chasing an elusive failure in their QUIC stack. Tests that once sailed through suddenly timed out. Downloads that should finish in seconds dragged on. The culprit? A congestion window stuck at its absolute floor. Two packets. No growth. A self-reinforcing loop that refused to break.
This wasn’t some obscure edge case. It struck after real packet loss on connections running CUBIC, the loss-based algorithm that dominates both TCP in the Linux kernel and Cloudflare’s open-source QUIC implementation called quiche. The bug pinned performance so low that a 10-megabyte transfer under moderate early loss failed to complete within 10 seconds 61 percent of the time. Fix it, and every test passes. The download finishes in four or five seconds.
But the story runs deeper than one missed deadline. It reveals how a well-intentioned optimization introduced years ago in the Linux kernel collided with the realities of user-space transport protocols. And it exposes the narrow path QUIC implementers must walk when they borrow ideas from decades of TCP battle-testing.
CUBIC, now standardized in RFC 9438, models congestion window growth with a cubic function. Its epoch starts after loss events. The time delta since that epoch determines how aggressively the window should open. Idle periods complicate the math. Send nothing for a while and the curve would otherwise explode upon resumption. The kernel developers solved this by shifting the epoch forward by the idle duration instead of resetting it. The approach preserved the mathematical shape of the growth curve. It made sense.
Cloudflare engineers ported a related Linux commit from 2017 into quiche. The original kernel change by Eric Dumazet, Yuchung Cheng and Neal Cardwell addressed a genuine TCP problem described in the RFC. Yet in quiche the same logic produced the opposite outcome. The congestion window refused to leave its minimum. Recovery never happened.
Here’s how the trap closed. After a severe loss event drops the window to two packets, the connection sends its entire allowance. One round-trip time later the acknowledgments arrive. Bytes in flight fall to zero. The next burst triggers the on_packet_sent callback. It sees bytes_in_flight equal to zero and concludes the connection sat idle. So it adds the full elapsed time since the last sent packet to the congestion recovery start time. That pushes the recovery timestamp into the future.
Every subsequent acknowledgment then evaluates as still inside the recovery period. CUBIC refuses to grow the window. The cycle repeats every 14 milliseconds or so, matching the round-trip time. One test run logged 999 state transitions in 6.7 seconds. The spiral feeds itself. And this continues until tiny variations in scheduler timing or acknowledgment processing finally let the inequality slip. Sometimes it breaks the loop. Often it does not.
The difference between kernel and user-space implementations proved decisive. Linux can hook into events like CA_EVENT_TX_START when the TCP stack begins transmitting after a pause. QUIC lacks those callbacks. Its congestion controller must infer state from packet sends, acknowledgments and the bytes_in_flight counter alone. The idle measurement therefore used the wrong reference point: time of last packet sent rather than the moment bytes_in_flight actually reached zero after processing the final acknowledgment.
So the fix required three lines of logic and one new timestamp. Engineers added last_ack_time to the CUBIC state structure. They update it whenever acknowledgments arrive. Then in on_packet_sent they calculate the idle start as the later of last_ack_time or last_sent_time. The delta uses that value. The adjustment to congestion_recovery_start_time now reflects true idle time instead of an entire round trip spent waiting for acknowledgments. The death spiral vanishes.
One might expect such a subtle timing bug to appear in production dashboards. It did not. High-volume flows rarely collapse all the way to the two-packet floor. The failure mode only manifests after specific patterns of early loss followed by clean recovery conditions. Steady-state throughput graphs stayed green. Static code review missed the interaction. Only rigorous testing under induced loss exposed the flaw.
Cloudflare’s experience echoes broader challenges facing QUIC operators. Just five days after the death-spiral post, the same team published details on udpgrm, a lightweight daemon designed to handle graceful restarts for UDP-based services without dropping packets. Author Marek Majkowski explained how traditional restart techniques fail when QUIC connections carry state across multiple packets. The tool uses SO_REUSEPORT, eBPF and a custom QUIC dissector to migrate flows cleanly. Cloudflare Blog described the work as increasingly necessary now that HTTP/3 traffic demands zero-downtime upgrades at scale.
Industry observers have noted QUIC’s rising prominence. One April 2026 analysis argued the protocol will soon match TCP in operational importance, requiring deeper scrutiny of its independent congestion loops. Each QUIC connection runs its own control algorithm. Losses on one stream do not block others thanks to the design’s avoidance of head-of-line blocking. Yet that independence also means each flow must accurately detect and respond to congestion without kernel assistance. The Register highlighted these differences and the need for careful implementation.
Other recent commentary warns of silent breakage. Because QUIC rides over UDP, middleboxes that only inspect TCP traffic can interfere without warning. Firewalls, proxies and load balancers sometimes drop or throttle UDP in ways invisible to operators until users complain. A May 2026 post detailed how these opaque failures complicate debugging and deployment. Andrew Baker’s technical blog catalogued the extra work QUIC demands in userspace: encryption per record, custom loss detection, pacing and congestion state all managed inside the application rather than the operating system.
The death-spiral bug adds another data point. User-space congestion control offers speed of iteration. Engineers can ship new algorithms without waiting for kernel updates. Quiche already supports BBRv3 alongside CUBIC for exactly this reason. But the freedom carries risk. Every borrowed optimization must be re-examined against the precise timing semantics of packetized UDP delivery. What works when the kernel controls transmission scheduling can break when the application does.
Testing strategies must evolve accordingly. The Cloudflare team now emphasizes minimum-congestion-window recovery scenarios in their suite. They instrument with qlog to visualize state transitions that dashboards cannot see. And they treat ported kernel patches with fresh skepticism.
The fix itself shipped quietly. One altered line in the idle calculation. Full restoration of the test suite. Yet the episode lingers as a reminder. Protocols that move intelligence into endpoints gain flexibility. They also inherit every subtle assumption baked into the systems they replace. Define idle incorrectly and the math turns against you. Measure time from the wrong instant and recovery becomes impossible.
QUIC continues its march into the infrastructure that powers the web. HTTP/3 adoption grows. More services rely on its low-latency, multiplexed streams. The underlying transport must therefore prove not only fast but also predictable under stress. That requires hunting bugs like this one before they surface at global scale.
Cloudflare’s transparent write-up sets a standard. By detailing the exact sequence, the kernel commit that inspired the error, the precise code change and the performance delta before and after, the post gives other implementers a map. Avoid the same trap. Test the corner cases. Question every assumption about when a connection truly sits idle.
The internet’s traffic mix shifts. Loss-based controllers like CUBIC still handle the majority of flows. Their quirks matter. But model-based alternatives gain traction too. The modular design in quiche lets engineers experiment without rewriting the entire stack. That flexibility proved useful here. It will prove useful again as new congestion signals and algorithms emerge.
In the end the bug lasted only because the conditions aligned perfectly. Real loss. Minimum window. Precise round-trip alignment. Minor jitter to keep the spiral spinning. Remove any element and the failure disappears. Production traffic rarely lines up so neatly. That explains why the issue stayed hidden. It also explains why it demanded such careful diagnosis.
Engineers love elegant solutions. Shifting an epoch by the idle interval sounded perfect on paper. It still does for genuine pauses. The adjustment simply needed the correct definition of idle. Last acknowledgment time instead of last send time. A small difference. An enormous impact.
So the next time a test suite starts failing in ways that defy initial explanation, look closely at the timers. Question the idle logic. Check what the kernel did and whether the assumptions still hold. The answer may lie in a commit that fixed TCP years ago but quietly broke QUIC today.


WebProNews is an iEntry Publication