a agentk.it Browse tools
Back to Signals
arXiv · analysis signal

The Tool-Calling Training Gap: FireFly and EnvFactory Attack the Bottleneck

Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs

Signal thesis

The next leap in agent reliability won't come from better models — it will come from verified tool-call training data at scale.

Why it matters

Every tool indexed in agentk.it — whether MCP server, CLI tool, or workflow — depends on the agent's ability to call it correctly. The quality of tool-calling directly determines whether these tools are useful or dangerous.

Original source

https://arxiv.org/abs/2605.17558

Key takeaways

Read this first.

  1. FireFly uses real APIs (not synthetic mocks) to generate training data with verifiable labels
  2. EnvFactory introduces Agentic RL: agents learn tool use through reinforcement in executable environments
  3. Combined, these approaches could produce agents that reliably call tools in production
  4. OpenAPI-to-MCP conversion (600 endpoints) now has validation tooling via multi-agent LLM systems
Ecosystem impact

Where this changes the map.

MCP Server Developers

Your API documentation quality directly affects tool-call training data quality

Agent Framework Teams

FireFly's approach could be integrated into agent fine-tuning pipelines

Full English translation

Translated text.

The Bottleneck

Training an LLM to reliably call tools requires massive amounts of trajectory data — sequences of agent actions, tool calls, and outcomes. But generating this data at scale faces two problems: synthetic environments don’t match real API behavior, and tasks generated without ground-truth outcomes can’t be verified.

FireFly: Verified Data from Real APIs

Yuxuan Lu et al. present FireFly, a pipeline that generates verified tool-call data directly from real APIs. Unlike approaches that simulate tool environments, FireFly connects to actual APIs, executes tool calls, and records the outcomes. Each trajectory has a verifiable label — the tool call either succeeded or failed against the real API.

EnvFactory: Scaling via Reinforcement Learning

Minrui Xu et al. tackle the scaling problem from a different angle. EnvFactory synthesizes executable environments and uses Agentic Reinforcement Learning (Agentic RL) to train agents. The key insight: agents learn more robustly when they can explore and fail in safe environments, rather than being trained on static demonstration data.

Why This Matters for the Ecosystem

The 204 tools indexed in agentk.it only deliver value if agents can call them reliably. Papers like FireFly and EnvFactory represent the infrastructure layer that will make tool-calling agents production-ready. Combined with MCP as the standard protocol, we’re seeing the full stack emerge: protocol standardization (MCP), security enforcement (MCP Proxy/ADR), and training infrastructure (FireFly/EnvFactory).

Sources: FireFly (arXiv:2605.17558), EnvFactory (arXiv:2605.18703), OpenAPI→MCP (arXiv:2605.14312)

What to watch next

Follow-up signals.

  • Whether FireFly or EnvFactory release open-source implementations
  • Integration of these training methods into major agent frameworks (LangChain, CrewAI)
Source and permission

Trace the origin.

Original title
Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs
Source
arXiv
Author
Yuxuan Lu, Ziyi Wang, Yingzhou Lu
Original date
2026-05-17
Permission
open_license
Published
2026-05-19
Source URL
https://arxiv.org/abs/2605.17558
Connected map

Tools, agents, and concepts affected.