When Skills Hurt: Negative Result for CTF Agents

Summary

This paper presents a sobering counterpoint to the prevailing narrative that adding curated procedural knowledge (“Skills”) to LLM agents always improves performance. The authors re-analyze a controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions (55, 1,478, 1,976, and 4,147 lines), which map almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation.

The key finding is stark: the spread between the no-Skills and full-Skills conditions was only 8.9 percentage points—statistically insignificant across multiple tests (p = 0.71 for χ²; p = 0.25 for Cochran–Armitage trend test). Five of six pairwise Cohen’s h values fell below the 0.2 small-effect threshold. In one timing side-channel setting, Skills actively degraded performance.

The authors propose a novel explanatory variable: environment-feedback bandwidth. When an agent’s tool layer returns strict, schema-validated, low-latency observations, the environment itself supplies the procedural correction signal that Skills are normally needed to provide. This challenges the default assumption in the agent ecosystem that more procedural knowledge is always better.

Key Contributions

Negative result with statistical rigor: Demonstrates that in a high-feedback domain (offensive cybersecurity CTF), adding curated Skills yields no statistically significant improvement over a no-Skills baseline.
Introduction of “environment-feedback bandwidth”: A falsifiable variable that predicts when Skills help vs. hurt, grounded in the observation that strict, low-latency tool feedback substitutes for procedural knowledge.
Reanalysis pipeline: The authors will release their reanalysis pipeline to support replication, enabling the community to test the hypothesis in other domains.
Challenge to prevailing assumption: Directly questions the widely reported 16.2 pp average improvement from Skills, noting that 16 of 84 tasks in those benchmarks suffered negative deltas.

Implications

For Researchers

This paper provides a clean, falsifiable hypothesis for when Skills help and when they are redundant. Researchers should incorporate “environment-feedback bandwidth” as a control variable in future agent benchmarking studies. The finding that Skills can actively degrade performance in certain settings (e.g., timing side-channels) opens a new line of inquiry into the interaction between tool design, procedural knowledge, and task characteristics.

For Developers

For developers building MCP-based agents, this paper suggests a shift in investment strategy. Instead of defaulting to curating large Skill libraries, prioritize designing tools that return strict, schema-validated, low-latency observations. This may simplify agent architectures, reduce maintenance overhead, and in some cases improve performance. The paper also warns that adding Skills without considering the feedback bandwidth of the environment can backfire.

For Users

End users of AI agents—especially in cybersecurity, DevOps, and data engineering—should be aware that “more is not always better.” An agent that relies on rich tool feedback may be simpler, faster, and more reliable than one burdened with extensive procedural knowledge. However, users should also be cautious in edge cases (e.g., timing-sensitive tasks) where Skills can introduce noise.

References

https://arxiv.org/abs/2605.20023v1

When Skills Hurt: Negative Result for CTF Agents

Read this first.

Where this changes the map.

Translated text.