T To Play Claw Browse tools
Back to Signals
arXiv · analysis signal

MCP Poisoning Attacks: When Tool Manuals Lie to AI Agents

When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents

Signal thesis

The MCP ecosystem's trust in tool metadata creates a critical vulnerability that requires fundamental rethinking of agent security architecture.

Why it matters

For To Play Claw users building MCP-based agent systems, this research reveals that the very metadata protocols enabling tool interoperability also create a covert attack surface. The finding that standard prompt-guardrails can worsen security (Firewall Fallacy) means current best practices may be actively harmful, requiring immediate adoption of reactive self-correction mechanisms.

Original source

https://arxiv.org/abs/2605.24069v1

Key takeaways

Read this first.

  1. Tool Description Poisoning exploits the agent's reliance on metadata for planning, not executable code—bypassing traditional security checks
  2. Current prompt-guardrail defenses are not just ineffective but can be counterproductive, creating a false sense of security
  3. Reactive Self-Correction—where agents autonomously detect and revert malicious actions post-execution—shows promise as a defense
Ecosystem impact

Where this changes the map.

For Researchers

Opens a new research direction in agent security: cognitive-layer attacks that exploit metadata trust. The Firewall Fallacy phenomenon demands investigation into why and how guardrails backfire, and Reactive Self-Correction needs validation across more agent architectures.

For Developers

Immediate action required: audit all MCP tool descriptions for injection vulnerabilities, implement post-execution monitoring and rollback capabilities, and avoid over-reliance on prompt-based guardrails. Consider adding metadata integrity checks and tool description sanitization pipelines.

For Users

Be aware that AI agents using MCP tools can be manipulated through seemingly benign tool descriptions. Until defenses mature, exercise caution when deploying agents in high-stakes environments, and prefer agents with built-in self-correction and audit logging capabilities.

Full English translation

Translated text.

Summary

The rise of tool-using LLM agents, standardized by protocols like MCP, has unlocked unprecedented autonomous capabilities—but also introduced a covert attack surface targeting the agent’s cognitive planning layer. This paper systematically investigates Tool Description Poisoning (TDP), a novel semantic attack where malicious instructions are covertly injected into a tool’s descriptive metadata—the very “manual” an agent relies on for planning and decision-making.

The researchers introduce the MCP-TDP Security Benchmark, a high-fidelity sandbox environment comprising 32 realistic test cases spanning 6 distinct risk categories. Their evaluation of 8 mainstream LLMs reveals severe vulnerabilities: leading models like GPT-4o exhibit a nearly 100% Attack Success Rate (ASR) in six high-risk scenarios. Critically, common prompt-guardrail defenses are largely ineffective and can be counterproductive—a phenomenon termed the “Firewall Fallacy.”

Key Contributions

  • Systematic definition and formalization of Tool Description Poisoning (TDP) as a distinct attack class targeting agent metadata
  • MCP-TDP Security Benchmark: 32 realistic test cases across 6 risk categories (data exfiltration, privilege escalation, denial of service, etc.)
  • Empirical vulnerability assessment of 8 mainstream LLMs, showing near-100% ASR on GPT-4o
  • Identification of the Firewall Fallacy: prompt-guardrail defenses can worsen security outcomes
  • Proposal of Reactive Self-Correction: a novel defense mechanism where agents autonomously detect and revert malicious actions post-execution

Implications

For Researchers

This work establishes a new attack surface in agent security: cognitive-layer attacks that exploit metadata trust rather than code vulnerabilities. The Firewall Fallacy phenomenon demands urgent investigation into why prompt-based defenses backfire, potentially due to adversarial perturbations in the agent’s reasoning chain. Reactive Self-Correction opens a promising research direction in post-hoc security mechanisms for autonomous agents.

For Developers

Immediate action items: audit all MCP tool descriptions for injection vulnerabilities, implement post-execution monitoring and rollback capabilities, and avoid over-reliance on prompt-based guardrails. Consider adding metadata integrity checks, tool description sanitization pipelines, and runtime anomaly detection. The finding that guardrails can worsen security means current best practices may be actively harmful.

For Users

Until defenses mature, exercise caution when deploying MCP-based agents in high-stakes environments. Prefer agents with built-in self-correction and audit logging capabilities. Be aware that seemingly benign tool descriptions can manipulate agent behavior—treat tool metadata as untrusted input, similar to user prompts.

References

What to watch next

Follow-up signals.

  • Development of metadata sanitization standards for MCP tool descriptions
  • Adoption of Reactive Self-Correction as a standard agent architecture pattern
  • Emergence of adversarial training datasets for tool description robustness
Source and permission

Trace the origin.

Original title
When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents
Source
arXiv
Author
Shi Liu
Original date
2026-05-22
Permission
open_license
Published
2026-05-30
Source URL
https://arxiv.org/abs/2605.24069v1