State-Grounded Multi-Agent Synthetic Data Generation for Tool-Augmented LLMs

Rahul Khedar, Eshita,Sneha Teja Sree Reddy Thondapu,Mayank Malhotra,Arup Das,Jitesh Chandra,Yun-Shiuan Chuang,Chaitanya Kulkarni,Arun Menon,Linsey Pang,Avinash Karn,Mouli V,Prakhar Mehrotra

from arxiv, 9 pages, 5 figures, 6 tables, 1 algorithm

Training tool-augmented LLM agents requires large corpora of multi-turn, tool-grounded conversational data that is expensive to annotate, privacy-constrained in production settings, and largely absent from public datasets. We present StateGen, a synthetic data generation platform that produces scored, reasoning-trace-rich training conversations by orchestrating a four-role LLM loop: a persona-conditioned user simulator, an agent under test, a state-grounded tool simulator, and a multi-axis LLM judge. The key architectural contribution is an authoritative state manager that maintains a structured world-state object across turns, enforcing a backend-is-truth invariant that eliminates the dominant class of tool-call hallucinations by construction. StateGen extends naturally to hierarchical multi-agent settings by declaring sub-agents as tools, all sharing a single state object. We report results on 64,698 evaluated conversations across three production corpora: tool-call hallucination scores reach 9.66/10, the system supports persona-driven variation via a 23-dimensional trait vector, and a cleanly separated train and golden evaluation set split confirms the data is not memorization bait (per-criterion gap analysis). Comparison with eight external systems shows that no single publicly available platform combines multi-turn generation, state-grounded tool simulation, hierarchical multi-agent support, and built-in judge scoring.

翻译：训练面向工具增强型大语言模型智能体需要大量多轮、基于工具的对话数据，这类数据标注成本高昂、生产环境中存在隐私限制，且在公开数据集中极为稀缺。我们提出StateGen平台，通过编排四角色大语言模型循环（包含基于人格的用户模拟器、待测智能体、状态驱动工具模拟器及多维度大语言模型评判器），生成带有评分与丰富推理轨迹的训练对话。核心架构创新在于权威状态管理器——该组件跨轮次维护结构化的世界状态对象，强制执行后端即真相不变性，从构造层面消除了占比最高的工具调用幻觉类别。StateGen通过将子智能体声明为共享单一状态对象的工具，自然地扩展到分层多智能体场景。基于三个生产语料库中64,698条评估对话的实验结果显示：工具调用幻觉评分达9.66/10（满分10分），系统通过23维特征向量支持人格驱动的数据多样性，且训练集与黄金评估集严格分离后的逐准则差异分析证实数据不存在记忆风险。与八个外部系统的对比表明，当前无任何公开平台能同时实现多轮对话生成、状态驱动的工具模拟、分层多智能体支持及内置评判器评分功能。