交互即智能第二部分：面向长周期任务训练的异步人机协同推演 (Interaction as Intelligence Part II: Asynchronous Human-Agent Rollout for Long-Horizon Task Training)

Dayuan Fu,Yunze Wu,Xiaojie Cai,Lyumanshan Ye,Shijie Xia,Zhen Huang,Weiye Si,Tianze Xu,Jie Sun,Keyu Li,Mohan Jiang,Junfei Wang,Qishuo Hua,Pengrui Lu,Yang Xiao,Pengfei Liu

Large Language Model (LLM) agents have recently shown strong potential in domains such as automated coding, deep research, and graphical user interface manipulation. However, training them to succeed on long-horizon, domain-specialized tasks remains challenging. Current methods primarily fall into two categories. The first relies on dense human annotations through behavior cloning, which is prohibitively expensive for long-horizon tasks that can take days or months. The second depends on outcome-driven sampling, which often collapses due to the rarity of valid positive trajectories on domain-specialized tasks. We introduce Apollo, a sampling framework that integrates asynchronous human guidance with action-level data filtering. Instead of requiring annotators to shadow every step, Apollo allows them to intervene only when the agent drifts from a promising trajectory, by providing prior knowledge, strategic advice, etc. This lightweight design makes it possible to sustain interactions for over 30 hours and produces valuable trajectories at a lower cost. Apollo then applies supervision control to filter out sub-optimal actions and prevent error propagation. Together, these components enable reliable and effective data collection in long-horizon environments. To demonstrate the effectiveness of Apollo, we evaluate it using InnovatorBench. Our experiments show that when applied to train the GLM-4.5 model on InnovatorBench, Apollo achieves more than a 50% improvement over the untrained baseline and a 28% improvement over a variant trained without human interaction. These results highlight the critical role of human-in-the-loop sampling and the robustness of Apollo's design in handling long-horizon, domain-specialized tasks.

翻译：大型语言模型（LLM）智能体近期在自动编码、深度研究及图形用户界面操控等领域展现出巨大潜力。然而，训练其成功执行长周期、领域专业化任务仍具挑战性。现有方法主要分为两类：第一类依赖通过行为克隆获取密集人工标注，这对可能持续数日乃至数月的长周期任务而言成本过高；第二类依赖结果驱动的采样方法，此类方法常因领域专业化任务中有效正向轨迹的稀缺性而失效。本文提出Apollo——一种融合异步人工指导与动作级数据过滤的采样框架。Apollo不要求标注者全程跟随每个步骤，而是允许其在智能体偏离潜在成功轨迹时（通过提供先验知识、策略建议等方式）进行干预。这种轻量化设计使得持续超过30小时的交互成为可能，并以更低成本生成有价值的轨迹。随后，Apollo通过监督控制机制过滤次优动作并防止误差传播。这些组件共同实现了长周期环境中可靠高效的数据收集。为验证Apollo的有效性，我们使用InnovatorBench进行评估。实验表明，在InnovatorBench上训练GLM-4.5模型时，Apollo相较未训练基线提升超过50%，较无人机交互的变体训练提升28%。这些结果凸显了人在回路采样机制的关键作用，以及Apollo框架处理长周期领域专业化任务的鲁棒性。