Large Language Model (LLM) agents have recently shown strong potential in domains such as automated coding, deep research, and graphical user interface manipulation. However, training them to succeed on long-horizon, domain-specialized tasks remains challenging. Current methods primarily fall into two categories. The first relies on dense human annotations through behavior cloning, which is prohibitively expensive for long-horizon tasks that can take days or months. The second depends on outcome-driven sampling, which often collapses due to the rarity of valid positive trajectories on domain-specialized tasks. We introduce Apollo, a sampling framework that integrates asynchronous human guidance with action-level data filtering. Instead of requiring annotators to shadow every step, Apollo allows them to intervene only when the agent drifts from a promising trajectory, by providing prior knowledge, strategic advice, etc. This lightweight design makes it possible to sustain interactions for over 30 hours and produces valuable trajectories at a lower cost. Apollo then applies supervision control to filter out sub-optimal actions and prevent error propagation. Together, these components enable reliable and effective data collection in long-horizon environments. To demonstrate the effectiveness of Apollo, we evaluate it using InnovatorBench. Our experiments show that when applied to train the GLM-4.5 model on InnovatorBench, Apollo achieves more than a 50% improvement over the untrained baseline and a 28% improvement over a variant trained without human interaction. These results highlight the critical role of human-in-the-loop sampling and the robustness of Apollo's design in handling long-horizon, domain-specialized tasks.
翻译:大型语言模型(LLM)智能体近期在自动编码、深度研究及图形用户界面操控等领域展现出巨大潜力。然而,训练其成功执行长周期、领域专业化任务仍具挑战性。现有方法主要分为两类:第一类依赖通过行为克隆获取密集人工标注,这对可能持续数日乃至数月的长周期任务而言成本过高;第二类依赖结果驱动的采样方法,此类方法常因领域专业化任务中有效正向轨迹的稀缺性而失效。本文提出Apollo——一种融合异步人工指导与动作级数据过滤的采样框架。Apollo不要求标注者全程跟随每个步骤,而是允许其在智能体偏离潜在成功轨迹时(通过提供先验知识、策略建议等方式)进行干预。这种轻量化设计使得持续超过30小时的交互成为可能,并以更低成本生成有价值的轨迹。随后,Apollo通过监督控制机制过滤次优动作并防止误差传播。这些组件共同实现了长周期环境中可靠高效的数据收集。为验证Apollo的有效性,我们使用InnovatorBench进行评估。实验表明,在InnovatorBench上训练GLM-4.5模型时,Apollo相较未训练基线提升超过50%,较无人机交互的变体训练提升28%。这些结果凸显了人在回路采样机制的关键作用,以及Apollo框架处理长周期领域专业化任务的鲁棒性。