The Tool-Calling Training Gap: FireFly and EnvFactory Attack the Bottleneck

The Bottleneck

Training an LLM to reliably call tools requires massive amounts of trajectory data — sequences of agent actions, tool calls, and outcomes. But generating this data at scale faces two problems: synthetic environments don’t match real API behavior, and tasks generated without ground-truth outcomes can’t be verified.

FireFly: Verified Data from Real APIs

Yuxuan Lu et al. present FireFly, a pipeline that generates verified tool-call data directly from real APIs. Unlike approaches that simulate tool environments, FireFly connects to actual APIs, executes tool calls, and records the outcomes. Each trajectory has a verifiable label — the tool call either succeeded or failed against the real API.

EnvFactory: Scaling via Reinforcement Learning

Minrui Xu et al. tackle the scaling problem from a different angle. EnvFactory synthesizes executable environments and uses Agentic Reinforcement Learning (Agentic RL) to train agents. The key insight: agents learn more robustly when they can explore and fail in safe environments, rather than being trained on static demonstration data.

Why This Matters for the Ecosystem

The 204 tools indexed in To Play Claw only deliver value if agents can call them reliably. Papers like FireFly and EnvFactory represent the infrastructure layer that will make tool-calling agents production-ready. Combined with MCP as the standard protocol, we’re seeing the full stack emerge: protocol standardization (MCP), security enforcement (MCP Proxy/ADR), and training infrastructure (FireFly/EnvFactory).

Sources: FireFly (arXiv:2605.17558), EnvFactory (arXiv:2605.18703), OpenAPI→MCP (arXiv:2605.14312)

The Tool-Calling Training Gap: FireFly and EnvFactory Attack the Bottleneck

Read this first.

Where this changes the map.

Translated text.

The Bottleneck

FireFly: Verified Data from Real APIs

EnvFactory: Scaling via Reinforcement Learning

Why This Matters for the Ecosystem

Follow-up signals.

Trace the origin.

Tools, agents, and concepts affected.