Contractual Skills: Making Enterprise AI Agents Governable

Summary

As enterprises deploy AI agents for increasingly complex tasks, the need for governance mechanisms that are both lightweight and inspectable has become critical. Ting Liu’s paper introduces “contractual skills”—a design framework inspired by GovernSpec that extends the common SKILL.md pattern with explicit fields for goals, input boundaries, permissions, evidence requirements, output contracts, quality criteria, verification steps, human approval points, and handoff rules.

The paper’s key insight is that contractual skills should serve as a governance layer rather than a performance optimization. Through two offline experiments—a text-generation study with 960 outputs across 8 models and a tool-calling challenge with 192 simulated records—Liu demonstrates that contractual skills consistently outperform no-skill and minimal-skill baselines. However, when compared to information-rich plain expanded skills, the gains are small and mixed, confirming that the primary value lies in making task intent and boundaries explicit rather than improving raw generation quality.

The tool-calling results are particularly instructive: while contractual skills reduced high-risk tool attempts, model-level differences persisted, and runtime tool guardrails remained necessary. This reinforces the paper’s central thesis that contractual skills are best understood as a complement to, not a replacement for, runtime safety mechanisms.

Key Contributions

Contractual Skills Framework: A GovernSpec-inspired design pattern for SKILL.md files that makes goals, boundaries, permissions, evidence requirements, output contracts, quality criteria, verification steps, human approval points, and handoff rules inspectable
Architectural Clarification: Clear delineation between contractual skills, GovernSpec YAML contracts, Model Context Protocol surfaces, tool adapters, runtime guardrails, tracing, and evaluation systems
Empirical Validation: Two offline experiments with 960 text-generation outputs and 192 tool-call records across 8 models, with 1680 cross-judge score records
Baseline Comparisons: Systematic comparison across four instruction conditions (no-skill, minimal-skill, contractual-skill, expanded-skill) showing where contractual skills add value and where they don’t
Safety Analysis: Evidence that contractual skills reduce high-risk tool attempts but cannot replace runtime guardrails

Implications

For Researchers

This paper provides a reproducible experimental framework for studying skill governance in enterprise AI agents. The 960-output, 1680-cross-judge dataset offers a benchmark for future work on skill structure and agent behavior. Researchers should explore how contractual fields interact with different model architectures and reasoning strategies, and investigate whether certain contractual fields (e.g., evidence requirements vs. quality criteria) have disproportionate impact on agent behavior.

For Developers

The framework offers a practical, immediately applicable pattern for structuring SKILL.md files. Developers can adopt contractual skills incrementally—starting with goal and boundary fields, then adding verification steps and approval points as needed. The paper’s architectural clarifications help avoid common confusions between skills, YAML contracts, MCP surfaces, and runtime guardrails, enabling cleaner system designs.

For Users

Enterprise users gain confidence that agent skills have explicit, auditable boundaries and acceptance criteria. The framework enables better human oversight through defined approval points and handoff rules, making it easier to integrate AI agents into regulated workflows. However, users should understand that contractual skills are a governance layer, not a safety guarantee—runtime monitoring remains essential.

References

https://arxiv.org/abs/2605.22634v1

Contractual Skills: Making Enterprise AI Agents Governable

Read this first.

Where this changes the map.

Translated text.