AttenA+: Rectifying Action Inequality in Robotic Foundation Models

Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions as equally informative during optimization. This "flat" training paradigm, inherited from language modeling, remains indifferent to the underlying physical hierarchy of manipulation. In reality, robot trajectories are fundamentally heterogeneous, where low-velocity segments often dictate task success through precision-demanding interactions, while high-velocity motions serve as error-tolerant transitions. Such a misalignment between uniform loss weighting and physical criticality fundamentally limits the performance of current Vision-Language-Action (VLA) models and World-Action Models (WAM) in complex, long-horizon tasks. To rectify this, we introduce AttenA+, an architecture-agnostic framework that prioritizes kinematically critical segments via velocity-driven action attention. By reweighting the training objective based on the inverse velocity field, AttenA+ naturally aligns the model's learning capacity with the physical demands of manipulation. As a plug-and-play enhancement, AttenA+ can be integrated into existing backbones without structural modifications or additional parameters. Extensive experiments demonstrate that AttenA+ significantly elevates the ceilings of current state-of-the-art models. Specifically, it improves OpenVLA-OFT to 98.6% (+1.5%) on the Libero benchmark and pushes FastWAM to 92.4% (+0.6%) on RoboTwin 2.0. Real-world validation on a Franka manipulator further showcases its robustness and cross-task generalization. Our work suggests that mining the intrinsic structural priors of action sequences offers a highly efficient, physics-aware complement to standard scaling laws, paving a new path for general-purpose robotic control.

翻译：现有机器人基础模型虽功能强大，但其核心隐含假设为时间均质性——在优化过程中将所有动作视为具有同等信息价值。这种从语言模型继承而来的"扁平化"训练范式，始终未能体现操作任务中蕴含的物理层次结构。实际上，机器人轨迹本质上具有异质性：低速运动段往往通过高精度交互决定任务成败，而高速运动段则作为容错过渡环节。这种均匀损失加权与物理关键性之间的错位，从根本上限制了当前视觉-语言-动作模型（VLA）和世界-动作模型（WAM）在复杂长时域任务中的性能。为此，我们提出AttenA+——一种与架构无关的框架，通过速度驱动的动作注意力机制优先关注运动学关键片段。通过基于逆速度场重新加权训练目标，AttenA+自然地将模型的学习能力与操作任务的物理需求对齐。作为一种即插即用增强方案，AttenA+无需结构性修改或额外参数即可集成至现有主干网络。大量实验表明，AttenA+显著提升了当前最优模型的性能上限：具体而言，其在Libero基准上将OpenVLA-OFT提升至98.6%（+1.5%），在RoboTwin 2.0上将FastWAM推至92.4%（+0.6%）。在Franka机械臂上的真实世界验证进一步展示了其鲁棒性与跨任务泛化能力。本研究表明，挖掘动作序列的内在结构先验，为标准缩放定律提供了一种高效、具有物理感知能力的补充，为通用机器人控制开辟了新路径。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《人机协作中的自适应任务规划与动态角色分配》最新30页报告

专知会员服务

27+阅读 · 2025年11月21日

面向机器人操作的基于大型视觉‑语言模型（VLM）的视觉‑语言‑动作（VLA）模型综述

专知会员服务

34+阅读 · 2025年8月19日

《基于时序逻辑规范的移动机器人规划与控制研究进展》最新180页

专知会员服务

22+阅读 · 2025年5月30日

【斯坦福大学博士论文】学习连续体机器人控制中的主要动力学

专知会员服务

16+阅读 · 2025年4月19日