Beyond Final Answers: Auditing Hidden Failures in Multi-Agent Workflows

Summary

Large Language Models (LLMs) are increasingly deployed as autonomous agents that reason, use tools, and act over multiple steps. Yet most hallucination benchmarks still evaluate only the final output, missing failures that originate in intermediate Thought-Action-Observation steps. This paper presents Trajel, a dataset and evaluation framework for auditing trajectory-level hallucinations in multi-agent industrial workflows.

The authors introduce a five-type hallucination taxonomy—factual, referential, logical, procedural, and scope-based—over expert-annotated agent traces from AssetOpsBench. Their results show that the most common failure modes are missed by existing benchmarks, that nearly half of hallucinated trajectories involve multiple types at once, and that automated detectors with high binary accuracy still misclassify the subtlest types. Trajectory-aware detection significantly outperforms standard post-hoc verification, making taxonomy-grounded evaluation necessary for safer agentic deployment.

Key Contributions

Trajel dataset: Expert-annotated agent traces from industrial AssetOpsBench with trajectory-level hallucination labels
Five-type hallucination taxonomy: Factual, referential, logical, procedural, and scope-based categories for intermediate failures
Benchmarking results: Supervised detection models evaluated at subtask, trajectory, and long-context levels
Key finding: Nearly half of hallucinated trajectories contain multiple hallucination types simultaneously
Detection insight: Trajectory-aware detection significantly outperforms standard post-hoc verification

Implications

For Researchers

This work shifts the evaluation paradigm from output-only to process-aware auditing. The Trajel dataset and taxonomy provide a rigorous foundation for studying how reasoning errors compound across multi-step agent workflows. Researchers can now benchmark models on intermediate reasoning quality, not just final answer correctness.

For Developers

Developers building multi-agent systems should integrate trajectory-level verification rather than relying on post-hoc checks. The five-type taxonomy offers a practical framework for debugging agent failures in production. Tools that monitor intermediate Thought-Action-Observation steps will be essential for safe deployment.

For Users

End users of agentic systems—especially in industrial and enterprise contexts—will benefit from safer, more reliable deployments. Trajectory-level auditing catches failures that could lead to incorrect decisions, financial losses, or safety issues in automated workflows.

References

https://arxiv.org/abs/2605.24219v2

Beyond Final Answers: Auditing Hidden Failures in Multi-Agent Workflows

Read this first.

Where this changes the map.

Translated text.