As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution is becoming a core requirement for realistic deployment. However, existing benchmarks largely assume uninterrupted agent behavior or study interruptions only in short, unconstrained language tasks. In this paper, we present the first systematic study of interruptible agents in long-horizon, environmentally grounded web navigation tasks, where actions induce persistent state changes. We formalize three realistic interruption types, including addition, revision, and retraction, and introduce InterruptBench, a benchmark derived from WebArena-Lite that synthesizes high-quality interruption scenarios under strict semantic constraints. Using a unified interruption simulation framework, we evaluate six strong LLM backbones across single- and multi-turn interruption settings, analyzing both their effectiveness in adapting to updated intents and their efficiency in recovering from mid-task changes. Our results show that handling user interruptions effectively and efficiently during long-horizon agentic tasks remains challenging for powerful large-scale LLMs. Code and dataset are available at https://github.com/HenryPengZou/InterruptBench.
翻译:随着LLM智能体从短时静态问题求解转向在动态环境中执行复杂的长时域任务,处理用户中断(如添加需求或修改目标)的能力正成为实际部署的核心要求。然而,现有基准测试大多假设智能体行为不受中断影响,或仅在短时无约束语言任务中研究中断情况。本文首次系统性地研究了长时域、环境约束型网页导航任务中的可中断智能体(其中动作会引发持久状态变化)。我们形式化定义了三种真实中断类型(包含添加、修订和撤回),并引入InterruptBench基准测试(基于WebArena-Lite构建),在严格语义约束下合成高质量中断场景。通过统一中断模拟框架,我们评估了六种强LLM骨干网络在单轮和多轮中断设置下的表现,分析了它们适应更新意图的有效性及从中途变更中恢复的效率。结果表明,在长时域智能体任务中高效处理用户中断对强大规模LLM仍具挑战性。代码与数据集见https://github.com/HenryPengZou/InterruptBench。