Orchard: An Open-Source Agentic Modeling Framework

Baolin Peng,Wenlin Yao,Qianhui Wu,Hao Cheng,Xiao Yu,Rui Yang,Tao Ge,Alessandro Sordoni,Xingdi Yuan,Yelong Shen,Pengcheng He,Tong Zhang,Zhou Yu,Jianfeng Gao

Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.

翻译：智能体建模旨在将大语言模型转化为能够通过规划、推理、工具使用以及与环境的多次交互来解决复杂任务的自主智能体。尽管投入了大量资源，开放研究仍受到基础设施和训练方面差距的制约。许多高性能系统依赖于专有代码库、模型或服务，而大部分开源框架则侧重于编排与评估，而非可扩展的智能体训练。本文提出果园（Orchard），一个面向可扩展智能体建模的开源框架。其核心是果园环境（Orchard Env），一种轻量级环境服务，为跨任务领域、智能体工具集和流水线阶段的沙箱生命周期管理提供了可复用的原语。在果园环境之上，我们构建了三种智能体建模方案。果园软件工程（Orchard-SWE）面向编码智能体。我们从MiniMax-M2.5和Qwen3.5-397B中蒸馏出10.7万条轨迹，引入信用分配监督微调（credit-assignment SFT）以从未解决轨迹的有效片段中学习，并采用平衡自适应展开（Balanced Adaptive Rollout）进行强化学习。从Qwen3-30B-A3B-Thinking出发，果园软件工程在监督微调后于SWE-bench Verified上达到64.3%，在监督微调结合强化学习后达到67.5%，在同规模开源模型中树立了新的最优水平。果园图形用户界面（Orchard-GUI）仅使用0.4K条蒸馏轨迹和2.2K个开放式任务，训练了一个40亿参数的视觉语言计算机使用智能体。它在WebVoyager、Online-Mind2Web和DeepShop上分别实现了74.1%、67.0%和64.0%的成功率，成为最强的开源模型，同时与专有系统保持竞争力。果园爪（Orchard-Claw）面向个人助理智能体。仅使用0.2K个合成任务进行训练，它在Claw-Eval上达到59.6%的pass@3，当与更强的ZeroClaw工具集配对时达到73.9%。这些结果共同表明，一个轻量级、开放且与工具集无关的环境层能够实现跨领域的可复用智能体数据、训练方案和评估。