Vision-language-action (VLA) models show potential for general robotic tasks, but remain challenging in spatiotemporally coherent manipulation, which requires fine-grained representations. Typically, existing methods embed 3D positions into visual representations to enhance the spatial precision of actions. However, these methods struggle to achieve temporally coherent control over action execution. In this work, we propose VLA-4D, a general VLA model with 4D awareness for spatiotemporally coherent robotic manipulation. Our model is guided by two key designs: 1) 4D-aware visual representation. We extract visual features, embed 1D time into 3D positions for 4D embeddings, and fuse them into a unified visual representation via a cross-attention mechanism. 2) Spatiotemporal action representation. We extend conventional spatial action representations with temporal information to enable the spatiotemporal planning, and align the multimodal representations into the LLM for spatiotemporal action prediction. Within this unified framework, the designed visual and action representations jointly make robotic manipulation spatially-smooth and temporally-coherent. In addition, we extend the VLA dataset with temporal action annotations for fine-tuning our model. Extensive experiments have been conducted to verify the superiority of our method across different tasks of robotic manipulation.
翻译:视觉-语言-动作(VLA)模型在通用机器人任务中展现出潜力,但在时空一致的操控方面仍面临挑战,这需要细粒度的表征。现有方法通常将三维位置嵌入视觉表征以提升动作的空间精度,但这些方法难以实现对动作执行的时序一致控制。本研究提出VLA-4D,一种具备四维感知的通用VLA模型,用于实现时空一致的机器人操控。我们的模型基于两个关键设计:1)四维感知的视觉表征。我们提取视觉特征,将一维时间嵌入三维位置以形成四维嵌入,并通过交叉注意力机制将其融合为统一的视觉表征。2)时空动作表征。我们在传统空间动作表征基础上扩展时序信息,以实现时空规划,并将多模态表征对齐至大语言模型中进行时空动作预测。在此统一框架下,所设计的视觉与动作表征共同使机器人操控在空间上平滑、时间上一致。此外,我们扩展了VLA数据集,加入时序动作标注以微调模型。大量实验验证了本方法在不同机器人操控任务中的优越性。