Trustworthy Agentic AI: Safety, Privacy & Security Survey

Summary

This survey from researchers at The Chinese University of Hong Kong and Southern University of Science and Technology provides the first comprehensive mapping of trustworthiness risks specific to agentic AI systems—LLMs augmented with planning, tool use, memory, and long-horizon interactions. Unlike traditional LLM trustworthiness surveys, this work focuses on the unique failure modes that emerge from multi-step agent trajectories, such as tool misuse cascades, memory poisoning across sessions, and adversarial attacks on planning components.

The authors organize their analysis around two core dimensions critical for high-risk deployments: Safety and Robustness (covering adversarial robustness, out-of-distribution generalization, and value alignment) and Privacy and System Security (covering data leakage, model extraction, and system-level vulnerabilities). For each dimension, they clarify key concepts, identify where risks emerge along the agent workflow, and summarize stage-targeted mitigation strategies. The paper also consolidates evaluation into a unified metrics-and-benchmarks hub, emphasizing both outcome and process signals, and provides scenario-to-metric guidance for release gating.

Key Contributions

First systematic mapping of trustworthiness risks to specific stages of the agent workflow (planning, tool use, memory, interaction)
Stage-targeted mitigation strategies for safety, robustness, privacy, and system security
Unified metrics-and-benchmarks hub with guidance on selecting appropriate metrics for different deployment scenarios
Case study of real-world security failures in open-source agentic systems
Identification of open challenges including self-evolving agents, runtime monitoring, privacy-preserving personalization, and the trust-utility trade-off

Implications

For Researchers

This survey provides a structured taxonomy that can guide future research agendas. The identification of open challenges—particularly self-evolving agents and runtime monitoring—highlights areas where current solutions are insufficient. The unified metrics hub also provides a foundation for developing standardized benchmarks that measure process-level trustworthiness signals, not just task completion.

For Developers

The stage-targeted mitigation strategies offer immediate practical value. Developers can use the risk mapping to identify where their agent systems are most vulnerable and apply appropriate countermeasures. The scenario-to-metric guidance for release gating provides a framework for making deployment decisions based on trustworthiness signals.

For Users

The survey’s emphasis on the trust-utility trade-off and transparency mechanisms is directly relevant to end users. Understanding that agentic systems involve inherent trade-offs between capability and trustworthiness empowers users to make informed decisions about adoption, particularly in high-stakes environments.

References

https://arxiv.org/abs/2605.23989v1

Trustworthy Agentic AI: Safety, Privacy & Security Survey

Read this first.

Where this changes the map.

Translated text.