Mixture of Horizons in Action Chunking

Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://timsty1.github.io/moh/

翻译：视觉-语言-动作（VLA）模型在机器人操作中展现了卓越能力，但其性能高度依赖于训练过程中使用的**动作块长度**，即**视野**。我们的实证研究揭示了一个固有折衷：长视野可提供更强的全局前瞻能力，但会降低细粒度精度；短视野则能提升局部控制精度，却难以胜任长期任务——这表明固定选择单一视野是次优方案。为缓解这一矛盾，我们提出**多视野混合策略（MoH）**。该策略将动作块重构为多个具有不同视野的片段，通过共享动作Transformer并行处理，并利用轻量线性门控融合输出。该方法具备三个显著优势：1) MoH可在单模型内协同利用长程前瞻与短程精度，同时提升复杂任务的性能与泛化能力；2) MoH可作为即插即用组件嵌入全注意力动作模块，训练与推理开销极低；3) MoH支持基于自适应视野的动态推理，通过跨视野一致性选择稳定动作，在保持优异性能的同时实现比基线方法高2.5倍的吞吐量。在基于流的策略（π₀、π₀.₅）和单步回归策略（π_reg）上的大量实验表明，MoH在仿真与真实世界任务中均能带来一致且显著的性能提升。值得注意的是，在混合任务设置下，配备MoH的π₀.₅模型仅需30k次训练迭代即在LIBERO基准上达到99%的平均成功率，创下新的最优性能记录。项目页面：https://timsty1.github.io/moh/

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【伯克利博士论文】基于动作分块策略的强化学习

专知会员服务

9+阅读 · 6月7日

【ICML 2026】面向视野外操作的VLA空间记忆框架SOMA

专知会员服务

8+阅读 · 5月22日

【ICML 2026】 StableVLA：无需额外数据，基于信息瓶颈的自适应鲁棒性视觉-语言-动作模型

专知会员服务

6+阅读 · 5月19日

机器人领域中的视觉-语言-动作模型：数据集、基准测试与数据引擎综述

专知会员服务

13+阅读 · 4月29日