Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans in natural language while agent behavior manifests as an execution trace of tool calls. Existing evaluations of LLM agents rely on manually constructed benchmarks or LLM-based judges, which either do not scale or lack reliability for complex, long-horizon manuals. To overcome these limitations, we present MANTRA, a framework for automatically synthesizing machine-checkable compliance benchmarks from natural-language manuals and tool schemas. MANTRA independently generates (i) a symbolic world model capturing procedural dependencies, and (ii) a set of trace-level compliance checks for a given task, and validates their consistency using SMT solving. A structured repair loop resolves inconsistencies, requiring human intervention only as a fallback. %This yields benchmarks that are formally validated. Importantly, MANTRA supports arbitrary domains and long procedural manuals, and provides a tunable notion of task complexity which is utilized to automatically derive challenging tasks accompanying compliance checks. Using MANTRA, we build a new benchmark suite with 285 tasks across 6 domains scaling to 50+ page manuals with minimal human effort. Empirically, we show that the compliance checks are richer with stronger constraint enforcement compared to existing benchmarks. Additionally, the granularity of the checks can be used for debugging the agents' failure modes. These results demonstrate that combining automated benchmark generation with formally grounded validation methods enables scalable and reliable benchmarking of tool-using agents.
翻译:工具使用型大语言模型(LLM)智能体越来越多地被部署在需严格遵循程序化操作手册的场景中。确保此类智能体遵守手册规则极具挑战性,因为手册通常以自然语言为人类编写,而智能体行为则表现为工具调用的执行轨迹。现有LLM智能体评估依赖人工构建基准或基于LLM的评判器,这些方法在处理复杂、长程手册时存在扩展性不足或可靠性缺陷。为突破这些限制,我们提出MANTRA——一种从自然语言手册与工具模式中自动合成机器可验证合规基准的框架。MANTRA独立生成:(i)捕获程序依赖关系的符号化世界模型;(ii)针对给定任务的一组轨迹级合规性检查,并通过可满足性模理论(SMT)求解验证其一致性。结构化修复循环可解决不一致问题,人类干预仅作为后备方案。最终生成的基准通过形式化验证。关键的是,MANTRA支持任意领域与长篇程序手册,并提供可调的任务复杂度概念,用于自动生成随附合规性检查的挑战性任务。利用MANTRA,我们以最小人工构建了涵盖6个领域、285项任务且可扩展至50页以上手册的基准套件。实验表明,与现有基准相比,该合规性检查在约束执行强度上更为丰富;此外,检查的粒度可用于调试智能体的故障模式。这些结果证明,将自动基准生成与形式化验证方法相结合,能够实现工具使用型智能体的可扩展且可靠基准测试。