Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task-agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long-Horizon Augmented Workflows), a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions - Goals, Constraints, Inputs, and Context - at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome-critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.
翻译:能够在长时间内高效运行的长时程工作流智能体是实现真正自主系统的关键。其可靠执行的核心在于处理模糊情境的推理能力,这种情况下必须主动寻求澄清以确保任务正确执行。然而,由于缺乏可扩展的、与任务无关的框架来系统化构建和衡量模糊性对自定义工作流的影响,该领域进展受限。我们通过提出LHAW(长时程增强工作流)来解决这一空白——这是一个模块化、数据集无关的合成流程,能够通过系统性地在四个维度(目标、约束、输入和上下文)以可配置的严重程度移除信息,将任何明确定义的任务转化为可控的欠规范变体。与依赖大语言模型预测模糊性的方法不同,LHAW通过智能体实证试验验证变体,根据观察到的终止状态差异将其分类为结果关键型、发散型或良性型。我们依据提出的分类体系,发布了来自TheAgentCompany、SWE-Bench Pro和MCP-Atlas的285个任务变体,并附有形式化分析,用于衡量当前智能体在模糊场景中检测、推理和解决欠规范问题的能力。LHAW为长时程场景中智能体澄清行为的成本敏感评估提供了首个系统化框架,从而推动可靠自主系统的开发。