Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions as equally informative during optimization. This "flat" training paradigm, inherited from language modeling, remains indifferent to the underlying physical hierarchy of manipulation. In reality, robot trajectories are fundamentally heterogeneous, where low-velocity segments often dictate task success through precision-demanding interactions, while high-velocity motions serve as error-tolerant transitions. Such a misalignment between uniform loss weighting and physical criticality fundamentally limits the performance of current Vision-Language-Action (VLA) models and World-Action Models (WAM) in complex, long-horizon tasks. To rectify this, we introduce AttenA+, an architecture-agnostic framework that prioritizes kinematically critical segments via velocity-driven action attention. By reweighting the training objective based on the inverse velocity field, AttenA+ naturally aligns the model's learning capacity with the physical demands of manipulation. As a plug-and-play enhancement, AttenA+ can be integrated into existing backbones without structural modifications or additional parameters. Extensive experiments demonstrate that AttenA+ significantly elevates the ceilings of current state-of-the-art models. Specifically, it improves OpenVLA-OFT to 98.6% (+1.5%) on the Libero benchmark and pushes FastWAM to 92.4% (+0.6%) on RoboTwin 2.0. Real-world validation on a Franka manipulator further showcases its robustness and cross-task generalization. Our work suggests that mining the intrinsic structural priors of action sequences offers a highly efficient, physics-aware complement to standard scaling laws, paving a new path for general-purpose robotic control.
翻译:现有机器人基础模型虽功能强大,但其核心隐含假设为时间均质性——在优化过程中将所有动作视为具有同等信息价值。这种从语言模型继承而来的"扁平化"训练范式,始终未能体现操作任务中蕴含的物理层次结构。实际上,机器人轨迹本质上具有异质性:低速运动段往往通过高精度交互决定任务成败,而高速运动段则作为容错过渡环节。这种均匀损失加权与物理关键性之间的错位,从根本上限制了当前视觉-语言-动作模型(VLA)和世界-动作模型(WAM)在复杂长时域任务中的性能。为此,我们提出AttenA+——一种与架构无关的框架,通过速度驱动的动作注意力机制优先关注运动学关键片段。通过基于逆速度场重新加权训练目标,AttenA+自然地将模型的学习能力与操作任务的物理需求对齐。作为一种即插即用增强方案,AttenA+无需结构性修改或额外参数即可集成至现有主干网络。大量实验表明,AttenA+显著提升了当前最优模型的性能上限:具体而言,其在Libero基准上将OpenVLA-OFT提升至98.6%(+1.5%),在RoboTwin 2.0上将FastWAM推至92.4%(+0.6%)。在Franka机械臂上的真实世界验证进一步展示了其鲁棒性与跨任务泛化能力。本研究表明,挖掘动作序列的内在结构先验,为标准缩放定律提供了一种高效、具有物理感知能力的补充,为通用机器人控制开辟了新路径。