MCP Poisoning Attacks: When Tool Manuals Lie to AI Agents

Summary

The rise of tool-using LLM agents, standardized by protocols like MCP, has unlocked unprecedented autonomous capabilities—but also introduced a covert attack surface targeting the agent’s cognitive planning layer. This paper systematically investigates Tool Description Poisoning (TDP), a novel semantic attack where malicious instructions are covertly injected into a tool’s descriptive metadata—the very “manual” an agent relies on for planning and decision-making.

The researchers introduce the MCP-TDP Security Benchmark, a high-fidelity sandbox environment comprising 32 realistic test cases spanning 6 distinct risk categories. Their evaluation of 8 mainstream LLMs reveals severe vulnerabilities: leading models like GPT-4o exhibit a nearly 100% Attack Success Rate (ASR) in six high-risk scenarios. Critically, common prompt-guardrail defenses are largely ineffective and can be counterproductive—a phenomenon termed the “Firewall Fallacy.”

Key Contributions

Systematic definition and formalization of Tool Description Poisoning (TDP) as a distinct attack class targeting agent metadata
MCP-TDP Security Benchmark: 32 realistic test cases across 6 risk categories (data exfiltration, privilege escalation, denial of service, etc.)
Empirical vulnerability assessment of 8 mainstream LLMs, showing near-100% ASR on GPT-4o
Identification of the Firewall Fallacy: prompt-guardrail defenses can worsen security outcomes
Proposal of Reactive Self-Correction: a novel defense mechanism where agents autonomously detect and revert malicious actions post-execution

Implications

For Researchers

This work establishes a new attack surface in agent security: cognitive-layer attacks that exploit metadata trust rather than code vulnerabilities. The Firewall Fallacy phenomenon demands urgent investigation into why prompt-based defenses backfire, potentially due to adversarial perturbations in the agent’s reasoning chain. Reactive Self-Correction opens a promising research direction in post-hoc security mechanisms for autonomous agents.

For Developers

Immediate action items: audit all MCP tool descriptions for injection vulnerabilities, implement post-execution monitoring and rollback capabilities, and avoid over-reliance on prompt-based guardrails. Consider adding metadata integrity checks, tool description sanitization pipelines, and runtime anomaly detection. The finding that guardrails can worsen security means current best practices may be actively harmful.

For Users

Until defenses mature, exercise caution when deploying MCP-based agents in high-stakes environments. Prefer agents with built-in self-correction and audit logging capabilities. Be aware that seemingly benign tool descriptions can manipulate agent behavior—treat tool metadata as untrusted input, similar to user prompts.

References

https://arxiv.org/abs/2605.24069v1

MCP Poisoning Attacks: When Tool Manuals Lie to AI Agents

Read this first.

Where this changes the map.

Translated text.