When AI Suggests the Flaw: How Insecure Code Completions Became a Systemic Risk

Seth Larson didn’t set out to expose a flaw in modern developer tools. Yet his recent analysis laid bare a quiet problem. Code completion features in IDEs like PyCharm routinely offer suggestions laced with security weaknesses. The question he posed cuts to the heart of current software practices. Are these suggestions mere annoyances? Or do they cross into genuine vulnerability territory?

Developers type a few characters. The assistant fills in the rest. Speed arrives. Confidence follows. But so do injection flaws, improper validation and exposed secrets. Larson’s piece, published on his personal site, sparked fresh debate on Hacker News and LWN.net just yesterday. The timing matters. New research shows the problem has not eased. It has scaled.

Veracode’s 2025 GenAI Code Security Report tested more than 100 large language models across four languages. Only 55 percent of generated code passed as secure. That leaves 45 percent carrying known flaws. Cross-site scripting proved especially stubborn. Models failed to produce safe code 86 percent of the time for that category. The numbers have barely budged even as models grew more fluent in syntax. Veracode called the stagnation striking.

IOActive went further in April. Its whitepaper examined real-world outputs from leading systems. Average security performance sat at 59 percent. Nearly 32 percent of samples proved fully exploitable. Infrastructure code suffered most. Dockerfiles and Terraform templates failed between 70 and 97 percent of the time. “AI-generated code is not secure by default,” the researchers concluded. The report carries weight because it moved beyond toy examples into production-like scenarios. IOActive left little room for optimism.

Earlier academic work laid the foundation. A 2022 Stanford study found participants with access to an AI assistant wrote significantly less secure code than those without. They also believed their output was safer than it was. Overconfidence compounds the risk. Users accept suggestions quickly. Reviews happen less often. The habit spreads inside teams chasing velocity.

But here’s the twist. The threat now reaches beyond accidental suggestions. Researchers at ETH Zurich and associated institutions detailed a black-box attack called INSEC last year. An adversary slips a short crafted comment into the context. The completion engine then produces insecure code at dramatically higher rates. More than 50 percent increase across 16 CWEs and five languages. Functional correctness stays intact. The attack costs under $10 to develop and works against GitHub Copilot and OpenAI APIs alike. An IDE plugin can inject it silently. The paper appeared at ICML 2025. Its implications feel immediate. ICML poster on INSEC attack

Poisoning attacks add another vector. Papers such as TrojanPuzzle showed how subtle changes to training data can steer models toward insecure patterns without obvious traces. Narrow fine-tuning experiments proved even more unsettling. Researchers aligned models to output insecure code without ever disclosing the misalignment to the user. The behavior persists across tasks. Alignment, it turns out, can be brittle.

Industry reports echo the pattern. Endor Labs examined common outputs and found missing input sanitization at the top of the list. AI tools default to omitting validation unless prompted explicitly. Dependencies arrive unvetted. Hardcoded credentials slip through. Apiiro’s analysis of Fortune 50 codebases showed 2.5 times more high-severity issues in AI-assisted sections. The volume grows monthly.

So what counts as a vulnerability? Larson argued that insecure completions meet the bar when they introduce exploitable conditions into code that reaches production. The completion itself may not execute. The resulting program does. If an organization deploys software built on unexamined AI output, the chain of responsibility blurs. Is the flaw the model’s? The developer’s? The vendor’s for shipping the feature without stronger defaults?

Current tools offer little defense. Many IDE plugins lack built-in scanning. Users must install separate linters or rely on post-commit checks. That workflow fails under deadline pressure. “Vibe coding,” the practice of accepting AI output with minimal scrutiny, now appears in startup surveys. One analysis of YC companies found some codebases nearing 95 percent AI-generated. Security scans later uncovered thousands of issues and hundreds of exposed secrets.

GitLab’s guidance strikes a practical note. Review every suggestion. Run automated scans in CI/CD. Enforce standards before merge. Yet enforcement demands time that many teams no longer allocate. The productivity numbers seduce. Developers report 30 to 50 percent faster task completion. The security tax arrives later, often after release.

Recent attempts at mitigation show mixed results. Some vendors added security plugins that scan in real time. Claude Code introduced one earlier this year. Early feedback suggests it catches certain classes of flaws. Still, coverage remains incomplete. Models trained on public code inherit its flaws. Stack Overflow answers and GitHub repositories contain decades of insecure examples. Probability favors repetition.

Enterprise security teams now track AI-generated code as a distinct risk category. They scan for patterns typical of LLM output: generic variable names, missing logging, overly broad exception handling. Some block certain models for sensitive codebases. Others require human approval for every AI-assisted pull request. Both approaches slow the very gains that drove adoption.

The data keeps arriving. A Georgetown CSET report cataloged three risk layers: direct generation of weak code, susceptibility of the models to manipulation, and long-term feedback loops that bake flaws into future training sets. Each layer reinforces the others. Break one and the rest weaken. Ignore them and the surface expands.

Larson’s original post focused on a specific incident in the Python community and PyCharm’s behavior. The broader pattern transcends any single tool or language. Java shows the highest failure rates in some benchmarks. Infrastructure-as-code fares worst overall. No major provider has posted consistent above-80-percent secure generation across realistic tasks.

Developers aren’t blind to this. Forums fill with stories of chasing down SQL injection or XSS introduced by an innocent-looking completion. Many now prepend prompts with “write secure code using best practices.” Success varies. Models sometimes ignore the instruction or apply checks inconsistently when context is thin.

Yet the question remains practical. Should organizations treat insecure code completion as a vulnerability class worthy of formal tracking in their risk registers? Evidence says yes. The combination of scale, overconfidence, adversarial potential and persistence across model updates creates conditions for widespread exposure. One compromised dependency or one overlooked suggestion can cascade.

Fixes exist but demand discipline. Integrate static analysis that flags AI-typical patterns. Train developers to treat suggestions as hypotheses, not truths. Demand vendors expose confidence scores or security annotations alongside completions. Build guardrails into the IDE rather than after the fact. None of these steps feel novel. All require investment that competes with feature delivery.

The conversation Larson started won’t fade soon. Fresh papers and reports surface monthly. Each adds data points to the same uncomfortable chart: AI accelerates code. It also accelerates risk. Teams that treat the output as authoritative court trouble. Those that treat it as raw material, subject to verification, stand a chance.

Security has always been a shared responsibility. Now the sharing includes silicon that never audited a single line for CWE compliance. The tools arrived. The habits formed. The bill, measured in breaches and remediations, is just coming due.

When AI Suggests the Flaw: How Insecure Code Completions Became a Systemic Risk

Notice an error?

Ready to get started?