Current Vision-Language-Action (VLA) models primarily focus on mapping 2D observations to actions, but exhibit notable limitations in spatiotemporal perception and reasoning: 1) spatial representations often rely on additional sensors, introducing substantial computational overhead; 2) visual reasoning is typically limited to future-frame prediction, lacking alignment with the instruction-grounded scene and thus compromising spatiotemporal consistency. To address these challenges, we propose ConsisVLA-4D, a unified and efficient framework that enhances spatiotemporal consistency in 3D perception and 4D reasoning. Specifically, we design: 1) CV-Aligner, which ensures cross-view object semantic consistency by filtering instruction-relevant regions and aligning object identities across multiple viewpoints; 2) CO-Fuser, which guarantees cross-object spatial geometric consistency by eliminating spatial relation ambiguities between objects across views using compact latent representations. Building upon these, we introduce 3) CS-Thinker to achieve cross-scene spatiotemporal consistency as actions unfold. It learns implicit knowledge of local dynamics from object-semantic tokens of CV-Aligner and global depth from geometric tokens of CO-Fuser, thereby enhancing efficient visual reasoning under scene variations. Extensive experiments demonstrate that, benefiting from its efficient spatiotemporal consistency design, ConsisVLA-4D achieves 21.6% and 41.5% performance improvements, along with 2.3-fold and 2.4-fold inference speedups compared to OpenVLA on the LIBERO benchmark and real-world platforms, respectively.ConsisVLA-4D is open-sourced and publicly available at
翻译:当前的视觉-语言-动作(Vision-Language-Action, VLA)模型主要关注将二维观测映射为动作,但在时空感知与推理方面存在显著局限:1)空间表示通常依赖额外传感器,引入大量计算开销;2)视觉推理通常局限于未来帧预测,缺乏与指令所引导场景的对齐,从而损害了时空一致性。为解决这些挑战,我们提出了ConsisVLA-4D,一个统一且高效的框架,用于增强三维感知与四维推理中的时空一致性。具体而言,我们设计了:1)CV-Aligner,通过过滤指令相关区域并对齐多视角下的对象身份,确保跨视角对象语义一致性;2)CO-Fuser,通过利用紧凑的潜在表示消除跨视角对象之间的空间关系歧义,保证跨对象空间几何一致性。在此基础上,我们引入3)CS-Thinker,随着动作的展开实现跨场景时空一致性。它从CV-Aligner的对象语义令牌中学习局部动态的隐式知识,并从CO-Fuser的几何令牌中学习全局深度信息,从而在场景变化下增强高效的视觉推理。大量实验表明,得益于其高效的时空一致性设计,ConsisVLA-4D在LIBERO基准和真实世界平台上分别实现了21.6%和41.5%的性能提升,以及相比OpenVLA 2.3倍和2.4倍的推理加速。ConsisVLA-4D已开源,并公开于。