Large Language Model (LLM) agents have recently shown strong potential in domains such as automated coding, deep research, and graphical user interface manipulation. However, training them to succeed on long-horizon, domain-specialized tasks remains challenging. Current methods primarily fall into two categories. The first relies on dense human annotations through behavior cloning, which is prohibitively expensive for long-horizon tasks that can take days or months. The second depends on outcome-driven sampling, which often collapses due to the rarity of valid positive trajectories on domain-specialized tasks. We introduce Apollo, a sampling framework that integrates asynchronous human guidance with action-level data filtering. Instead of requiring annotators to shadow every step, Apollo allows them to intervene only when the agent drifts from a promising trajectory, by providing prior knowledge, strategic advice, etc. This lightweight design makes it possible to sustain interactions for over 30 hours and produces valuable trajectories at a lower cost. Apollo then applies supervision control to filter out sub-optimal actions and prevent error propagation. Together, these components enable reliable and effective data collection in long-horizon environments. To demonstrate the effectiveness of Apollo, we evaluate it using InnovatorBench. Our experiments show that when applied to train the GLM-4.5 model on InnovatorBench, Apollo achieves more than a 50% improvement over the untrained baseline and a 28% improvement over a variant trained without human interaction. These results highlight the critical role of human-in-the-loop sampling and the robustness of Apollo's design in handling long-horizon, domain-specialized tasks.
翻译:大型语言模型(LLM)智能体近期在自动编码、深度研究及图形用户界面操作等领域展现出强大潜力。然而,训练其在长周期、领域专业化任务上取得成功仍具挑战性。现有方法主要分为两类:第一类依赖通过行为克隆获得密集的人工标注,这对于可能持续数日甚至数月的长周期任务而言成本过高;第二类依赖于结果驱动的采样方法,此类方法常因领域专业化任务中有效正向轨迹的稀缺性而失效。我们提出了Apollo,一个融合异步人工指导与动作级数据过滤的采样框架。Apollo不要求标注者跟踪每一步操作,而是允许其在智能体偏离潜在成功轨迹时(通过提供先验知识、策略建议等方式)进行干预。这种轻量化设计使得持续超过30小时的交互成为可能,并以更低成本生成有价值的轨迹。随后,Apollo通过监督控制机制过滤次优动作并防止误差传播。这些组件共同实现了长周期环境中可靠高效的数据收集。为验证Apollo的有效性,我们使用InnovatorBench进行评估。实验表明,当应用于在InnovatorBench上训练GLM-4.5模型时,Apollo相比未训练基线取得超过50%的性能提升,相比无人机交互的变体训练提升28%。这些结果凸显了人在回路采样机制的关键作用,以及Apollo设计在处理长周期领域专业化任务时的鲁棒性。