Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks (e.g., tau-bench, tau2-bench, AppWorld) rely on fully deterministic backends, which are costly to build and iterate. We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database. Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints. Empirically, our benchmark produces stable, model-differentiating rankings across families and inference-time reasoning efforts, and its on-/off-policy rollouts provide supervision that transfers to unseen scenarios. Careful scenario specification yields near-zero simulator hallucination rates as supported by ablation studies. The framework also supports sensitivity analyses over user personas. Human-LLM judge agreement exceeds 90%, indicating reliable automated evaluation. Overall, proxy state-based evaluation offers a practical, scalable alternative to deterministic agentic benchmarks for industrial LLM agents.

翻译：通过多轮对话和多步骤工具调用运行的交互式大语言模型智能体正日益广泛应用于实际生产环境。针对此类智能体的基准测试必须既能可靠比较模型性能，又能生成同策略训练数据。现有智能体基准测试（如tau-bench、tau2-bench、AppWorld）依赖完全确定性的后端系统，其构建与迭代成本高昂。本文提出基于代理状态的评估框架——一种LLM驱动的仿真方法，在无需确定性数据库的情况下保持基于最终状态的评估机制。具体而言，每个场景规范包含用户目标、用户/系统事实、期望最终状态及期望智能体行为，通过LLM状态追踪器从完整交互轨迹中推断结构化代理状态。随后由LLM评判器根据场景约束验证目标完成度并检测工具/用户幻觉。实证研究表明，本基准测试能在不同模型家族和推理时间投入下产生稳定且具区分度的排名结果，其同策略/异策略推演提供的监督信号可迁移至未见场景。精细化场景规范配合消融实验证明，该框架可实现接近零的模拟器幻觉率。本框架还支持针对用户画像的敏感性分析。人工与LLM评判者的一致性超过90%，表明自动化评估具有可靠性。总体而言，基于代理状态的评估为工业级LLM智能体提供了一种实用、可扩展的确定性智能体基准测试替代方案。