Training trustworthy agentic LLMs requires data that shows the grounded reasoning process, not just the final answer. Existing datasets fall short: question-answering data is outcome-only, chain-of-thought data is not tied to specific documents, and web-agent datasets track interface actions rather than the core retrieval and synthesis steps of a RAG workflow. We introduce AgentSim, an open-source platform for simulating RAG agents. It generates verifiable, stepwise traces of agent reasoning over any document collection. AgentSim uses a policy to ensure the agent widely explores the document set. It combines a multi-model validation pipeline with an active human-in-the-loop process. This approach focuses human effort on difficult steps where models disagree. Using AgentSim, we construct and release the Agent-Trace Corpus (ATC), a large collection of grounded reasoning trajectories spanning three established IR benchmarks. We make three contributions: (1) the AgentSim platform with two mechanisms, Corpus-Aware Seeding and Active Validation, that improve trace diversity and quality; (2) the Agent-Trace Corpus (ATC), over 103,000 verifiable reasoning steps spanning three IR benchmarks, with 100% grounding rate on substantive answers; and (3) a comparative behavioral analysis revealing systematic differences in how state-of-the-art models approach information seeking. Platform, toolkit, and corpus are publicly available.
翻译:训练可信赖的智能体大语言模型需要能够展示基于推理过程的数据,而不仅仅是最终答案。现有数据集存在不足:问答数据仅包含结果,思维链数据未关联特定文档,网页智能体数据集追踪的是界面操作而非RAG工作流中的核心检索与综合步骤。我们提出AgentSim——一个用于模拟RAG智能体的开源平台。该平台可针对任意文档集合生成可验证的、分步骤的智能体推理轨迹。AgentSim通过策略确保智能体广泛探索文档集合,并结合多模型验证流程与主动人工干预机制。这种设计将人力聚焦于模型存在分歧的困难步骤。借助AgentSim,我们构建并发布了覆盖三个成熟信息检索基准的Agent-Trace Corpus (ATC)大型基础推理轨迹数据集。本文贡献包括:(1) 提出包含语料感知种子生成(Corpus-Aware Seeding)与主动验证(Active Validation)两种机制的AgentSim平台,显著提升轨迹多样性与质量;(2) 构建覆盖三个信息检索基准、包含逾103,000个可验证推理步骤的ATC数据集,实质性答案的文档关联率达到100%;(3) 通过对比行为分析揭示前沿模型在信息检索中的系统性差异。平台、工具包及数据集均已开源发布。