Cloudflare's Multi-Agent Code Review Machine: Scaling AI Oversight Across 130,000 Merge Requests

Cloudflare engineers submit merge requests by the thousands each month. Code review bottlenecks used to slow them down. Median wait times stretched into hours. Not anymore.

Ryan Skidmore, a Cloudflare software engineer, detailed how the company built a CI-native system around the open-source OpenCode agent. It triggers automatically on every merge request in GitLab. Up to seven specialized AI agents spring into action—covering security, performance, code quality, documentation, release management, and compliance with internal standards. A coordinator agent oversees them all, deduplicates findings, gauges severity, and posts one clean comment. Cloudflare Blog.

From March 10 to April 9, 2026, the system handled 131,246 runs across 48,095 merge requests in 5,169 repositories. Average 2.7 reviews per MR. Median runtime: 3 minutes 39 seconds. Costs averaged $1.19 per review, with a median of $0.98 and P99 at $4.45. It surfaced 159,103 findings—1.2 per review on average. Code quality led with 74,898 flags: 6,460 critical, 29,974 warnings, 38,464 suggestions.

Trivial changes get a light touch. Two agents only. Full reviews hit all seven for big diffs or security-sensitive files. Risk tiers base on lines changed, file count, and sensitivity. The code strips noise first—lock files, minified JS, generated code. Boom. Focused analysis.

Orchestration: Plugins, Prompts, and Model Routing

Flexibility drives the design. A plugin system swaps VCS providers, AI models, even internal rules. GitLab plugin pulls MR data. Cloudflare’s handles AI Gateway routing with failbacks. Codex plugin enforces engineering standards. Braintrust traces everything. Telemetry tracks it all.

OpenCode runs as child processes via Bun.spawn. JSONL streaming avoids buffering woes. Heartbeats every 30 seconds keep users from canceling. Prompts stay sharp: “What to Flag” and “What NOT to Flag” sections curb hallucinations. Security agents target exploitable bugs like injections, skip theory. Models tier up: Claude Opus 4.7 or GPT-5.4 for coordination; Sonnet 4.6 or GPT-5.3 Codex for grunt work; Kimi K2.5 for docs.

Coordinator consolidates. Deduplicates. Recategorizes. Filters nonsense—verifies with tools if needed. Approval logic biases toward yes. All suggestions? Approve with comments. Warnings without risk? Same. Multiple warnings or patterns? Unapprove. Criticals block merges. “Break glass” overrides exist—used 288 times, just 0.6% of runs.

Re-reviews stay smart. They recall prior findings. Omit fixes. Re-emit persistents. Respect author resolutions.

And resilience? Circuit breakers per model tier. Failback chains—like Opus 4.7 to 4.6. Timeouts at 25 minutes total, retries budgeted tight. Workers fetch dynamic configs from KV: disable providers, tweak models on the fly.

Integration? Dead simple. Add one line to .gitlab-ci.yml: include: - component: $CI_SERVER_FQDN/ci/ai/opencode@~latest. Repos override via AGENTS.md or Worker configs. Local testing via TUI command.

Broader Stack: 241 Billion Tokens and Rising Adoption

This isn’t isolated. Cloudflare’s full AI engineering stack powers it. Last 30 days: 3,683 users—60% company-wide, 93% R&D. 241 billion tokens through AI Gateway. 20 million requests. Every MR gets AI review—5.47 million Gateway requests, 24.77 billion tokens for reviews alone. Merge requests doubled to 8,700 weekly average, peaking at 10,952. Cloudflare Blog.

Built on their own stack: AI Gateway routes traffic, Workers AI infers cheaply—77% savings over proprietary. Dynamic Workers, Agents SDK, Sandbox. Backstage feeds 16,000-entity knowledge graphs. Engineering Codex distills rules for prompts.

Agent Memory layers on top. It ingests chat history, recalls facts across sessions. For code review, it remembers dismissed flags or kept patterns. “The reviewer now remembers that a particular comment wasn’t relevant in a past review,” the post notes. Reviews quiet down. Smarter over time. Shared across agents and teams—tribal knowledge sticks. Cloudflare Blog.

Context from AGENTS.md—auto-generated for 3,900 repos—guides agents on conventions, boundaries. Reviewers update it too. Closed loop.

Industry echoes the push. Anthropic’s Claude Code runs multi-agent PR reviews, boosting meaningful comments from 16% to 54%, per internal tests. InfoQ. But Cloudflare scales it internally first, open-sources pieces like OpenCode contributions (45+ PRs upstream).

Costs add up. Yet savings mount. Workers AI undercuts frontier models. Telemetry exposes inefficiencies. Engineers ship faster, safer. Humans focus high-value work.

Challenges remain. Hallucinations lurk. But scoped prompts, verification, memory curb them. False positives drop. Production risks blocked.

Cloudflare dogfoots hard. Their stack handles real volume. Others watch closely. Copy the plugins? Fork OpenCode? The blueprint’s public.

Scale wins. So does iteration.

Cloudflare’s Multi-Agent Code Review Machine: Scaling AI Oversight Across 130,000 Merge Requests

Notice an error?

Ready to get started?