Accurate scene perception is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we adopt a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing 4D modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF provides three interrelated outputs: reconstruction of the current scene, prediction of future frames, and estimation of init action via Gaussian motion. Furthermore, we employ an action-vision-aligned denoising framework, conditioned on a unified representation that combines the init action and the Gaussian perception, both generated by the GAF, to further obtain more precise actions. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR, +0.3864 SSIM and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average +7.3% success rate in robotic manipulation tasks over state-of-the-art methods.
翻译:精确的场景感知对于基于视觉的机器人操作至关重要。现有方法通常遵循两种范式:视觉到动作(V-A)范式(直接从视觉输入预测动作)或视觉到三维再到动作(V-3D-A)范式(利用中间三维表示)。然而,这些方法常因操作场景的复杂性和动态性而面临动作不精确的问题。本文采用V-4D-A框架,通过高斯动作场(GAF)从具有运动感知能力的四维表示中直接进行动作推理。GAF通过引入可学习的运动属性扩展了三维高斯泼溅(3DGS),实现了动态场景与操作行为的四维建模。为学习时变场景几何与动作感知的机器人运动,GAF提供三类相互关联的输出:当前场景重建、未来帧预测,以及基于高斯运动的初始动作估计。此外,我们采用动作-视觉对齐的去噪框架,以GAF生成的初始动作与高斯感知的统一表示作为条件,进一步获取更精确的动作。大量实验表明,GAF在重建质量上取得了显著提升:PSNR提高+11.5385 dB、SSIM提高+0.3864、LPIPS降低-0.5574;同时,在机器人操作任务中,其平均成功率较现有最优方法提升+7.3%。