Why Most Teams Botch Agent Skills and How a Few Get Them Right

Software engineers at companies racing to adopt AI coding tools often hand their agents generic instructions. Results disappoint. Then they blame the model. The pattern repeats across dev teams from startups to Big Tech.

But one engineer has watched this play out up close. Anson Biggs, a software engineer at Blue Origin, argues the problem lies less with the models and more with how people package knowledge for them. In a detailed post on his notes site, he calls out the common mistakes and lays out a sharper approach. (Anson Biggs notes)

Skills are not just long prompts. They function as structured packages of procedural knowledge. Loaded at the right moment, they give agents targeted expertise without flooding the context window. Done wrong, they add noise or worse, create the illusion of progress.

The Anti-Pattern That Wastes Everyone’s Time

Developers open a fresh chat with Claude or similar tool. They ask the model to generate a skill for a task it struggles with. The model complies. The team pastes the output into a markdown file and calls it done. This approach fails for a simple reason. The model lacks the hard-won context from actually wrestling the problem.

Biggs compares it to asking someone to write the steps for a peanut butter and jelly sandwich without ever making one. The gaps stay hidden until you hit them. A recent benchmark called SkillsBench tested this exact method. Self-generated skills delivered no average benefit. Curated ones lifted pass rates by 16.2 percentage points across tasks. Effects swung wildly by domain. Software engineering saw modest gains. Healthcare posted jumps near 52 points. Sixteen tasks even showed negative impact. (Anson Biggs notes)

The benchmark authors prompted models to generate procedural knowledge before solving. That step, Biggs says, simply reinvented chain-of-thought in a less effective form. Models cannot reliably author the precise knowledge they need to consume. Real value comes from observation after struggle.

But. When an agent gets stuck on a hard problem, the engineer steps in. Once unstuck, the team captures exactly what was missing. That insight becomes a skill. The difference shows immediately in follow-up runs.

Skills live in a folder structure under .claude/skills. Each gets its own directory. A central SKILL.md file holds metadata and core instructions. Supporting files add scripts, references or troubleshooting notes. One of Biggs’ own examples monitors GitLab CI pipelines. The skill tells the agent to watch jobs until they pass or fail. A companion shell script prevents sloppy custom code. Reference files cover edge cases. Older Claude versions suddenly handle the workflow smoothly.

Context management drives much of this. Agents start stateless. Every new session forgets prior work. A project-level CLAUDE.md file helps but cannot scale to monorepos with Docker quirks, architecture decisions and testing conventions. Skills fill those gaps for repeated patterns that aren’t universal.

Repetition offers another clear use case. Biggs created a skill that forces alignment across documentation, merge requests, issues and the codebase itself. Instead of repeating the same review instructions, the agent loads the skill and applies consistent standards. Simple. Effective.

Teams that master this report agents handling larger chunks of work. Yet scaling demands more than isolated skills. Biggs later outlined how to break engineering tasks into agent-friendly units. Independent features run in parallel. CI/CD pipelines provide sandboxes and essentially infinite compute. One orchestration agent decides what runs next. Multiple agents tackle separate tickets within sprints. Overbuilt test suites catch the inevitable drift from stateless operation. (Anson Biggs notes)

Planning shifts too. Engineers interview the agent using tools like glab for GitLab. They sketch epics, sprints and crawl-walk-run progressions. The human owns vision. The agent owns implementation details. Documentation becomes the single source of truth. Code reads like English. Alignment prompts run across tickets, MR descriptions, docs and implementation to keep everything coherent.

Industry observers see the same themes playing out at larger scale. A New Stack article makes the case that agents and skills must work together. Agents orchestrate workflows, maintain state and enforce boundaries. Skills package reusable expertise from domain specialists who may not write agent logic. Pete Hampton, principal software engineer at ClickHouse, put it plainly. “The future of agentic AI isn’t choosing between agents and skills. It’s agents equipped with the right skills at the right time.” (The New Stack)

ClickHouse built a CLI tool with specialized agents for scanning, migrating and quality assurance on Postgres and TypeScript. When they expanded to MySQL, MongoDB and additional languages such as Go, Java and Python, they added skills rather than rewrite core agent logic. Domain experts contribute modular knowledge. Evaluations stay intact. Context windows stay manageable through progressive loading. Metadata decides relevance first. Full content loads only when needed.

This modular thinking addresses a core limitation. Pure prompt engineering leads to bloated system prompts, unexpected regressions and brittle specialization. Skills allow iterative updates, versioning and isolated testing. The separation of concerns, orchestration in agents and knowledge in skills, creates systems that actually ship to production.

Recent data backs the momentum. A Firecrawl analysis of 2026 trends highlights CLI agents gaining ground in development. Engineers ship code 30% faster and have saved roughly 500,000 hours according to cited reports. Anthropic’s own 2026 Agentic Coding Trends Report notes engineers produce more output per interaction and save about 40 minutes each. One Claude Code example completed a seven-hour job on a 12.5 million line codebase with 99.9% accuracy. (Firecrawl blog)

Multi-agent teams show 50% faster candidate screening in some deployments. Vertical agents deliver 40% gains in narrow domains. Browser-based agents grew 45% year over year. Live web data cuts hallucinations by 35%. Yet governance questions mount. Surveys show nearly every organization has agentic AI on its 2026 roadmap, but only a small fraction deploy multiple solutions in production. Trust, auditability and outcome focus remain hurdles. (CIO.com)

So the pattern holds. Teams that treat skills as afterthoughts or one-shot generations see modest or negative returns. Those who treat them as living artifacts, refined after real friction, compound their agents’ effectiveness. They capture novel solutions from hard problems. They eliminate repetition on routine patterns. They document gaps the fresh model could never guess.

Apple’s WWDC 2026 moves and Microsoft’s MAI models signal the shift continues. Xcode gains deeper agent integration. Coding models reach more users. The question is no longer whether agents will handle more work. It is whether teams will give them the precise, battle-tested knowledge they need to succeed. A few already do. The rest keep pasting fresh generations into markdown files and wondering why nothing improves.

Skills done right don’t replace engineering judgment. They amplify it. They turn one-off insights into repeatable advantage. And in a year when every roadmap lists agentic AI, that distinction may separate the leaders from the experimenters.

Why Most Teams Botch Agent Skills and How a Few Get Them Right

Notice an error?

Ready to get started?