Recent vision-language-action (VLA) models and world action models (WAMs) advance robotic manipulation by enriching intermediate representations with auxiliary spatial features or future visual-state prediction. However, these representations largely remain within the observation space and do not share the rigid-body geometry of the action space, forcing the action decoder to implicitly recover this geometry. We propose OASIS, a visuomotor policy that aligns the intermediate representation with the action space via $SE(3)$ end-effector trajectory prediction. OASIS couples a 3D-aware feature encoder that fuses vision-language and metric-depth features with an $SE(3)$ trajectory predictor that produces a camera-frame end-effector trajectory. Conditioned on the predictor's pose-supervised hidden states, the action decoder generates action chunks consistent with rigid-body motion. Across simulation and real-world experiments, OASIS outperforms VLA and WAM baselines in success rate and out-of-distribution generalization. Our project page is available at https://npuhandsome.github.io/OASIS_web.
翻译:近期,视觉-语言-动作(VLA)模型与世界动作模型(WAMs)通过引入辅助空间特征或未来视觉状态预测来丰富中间表示,从而推动了机器人操作技术的发展。然而,这些表示仍主要局限于观测空间,未能共享动作空间的刚体几何结构,导致动作解码器需隐式恢复该几何结构。本文提出OASIS,一种通过SE(3)末端执行器轨迹预测实现中间表示与动作空间对齐的视觉运动策略。OASIS将融合视觉-语言与度量深度特征的三维感知编码器与产生相机坐标系下末端执行器轨迹的SE(3)轨迹预测器相结合。基于预测器经位姿监督的隐状态,动作解码器生成与刚体运动一致的动作片段。在仿真与真实世界实验中,OASIS在成功率和分布外泛化能力上均优于VLA与WAM基线方法。项目页面详见https://npuhandsome.github.io/OASIS_web。