ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

Current Vision-Language-Action (VLA) models primarily focus on mapping 2D observations to actions, but exhibit notable limitations in spatiotemporal perception and reasoning: 1) spatial representations often rely on additional sensors, introducing substantial computational overhead; 2) visual reasoning is typically limited to future-frame prediction, lacking alignment with the instruction-grounded scene and thus compromising spatiotemporal consistency. To address these challenges, we propose ConsisVLA-4D, a unified and efficient framework that enhances spatiotemporal consistency in 3D perception and 4D reasoning. Specifically, we design: 1) CV-Aligner, which ensures cross-view object semantic consistency by filtering instruction-relevant regions and aligning object identities across multiple viewpoints; 2) CO-Fuser, which guarantees cross-object spatial geometric consistency by eliminating spatial relation ambiguities between objects across views using compact latent representations. Building upon these, we introduce 3) CS-Thinker to achieve cross-scene spatiotemporal consistency as actions unfold. It learns implicit knowledge of local dynamics from object-semantic tokens of CV-Aligner and global depth from geometric tokens of CO-Fuser, thereby enhancing efficient visual reasoning under scene variations. Extensive experiments demonstrate that, benefiting from its efficient spatiotemporal consistency design, ConsisVLA-4D achieves 21.6% and 41.5% performance improvements, along with 2.3-fold and 2.4-fold inference speedups compared to OpenVLA on the LIBERO benchmark and real-world platforms, respectively.ConsisVLA-4D is open-sourced and publicly available at

翻译：当前的视觉-语言-动作（Vision-Language-Action, VLA）模型主要关注将二维观测映射为动作，但在时空感知与推理方面存在显著局限：1）空间表示通常依赖额外传感器，引入大量计算开销；2）视觉推理通常局限于未来帧预测，缺乏与指令所引导场景的对齐，从而损害了时空一致性。为解决这些挑战，我们提出了ConsisVLA-4D，一个统一且高效的框架，用于增强三维感知与四维推理中的时空一致性。具体而言，我们设计了：1）CV-Aligner，通过过滤指令相关区域并对齐多视角下的对象身份，确保跨视角对象语义一致性；2）CO-Fuser，通过利用紧凑的潜在表示消除跨视角对象之间的空间关系歧义，保证跨对象空间几何一致性。在此基础上，我们引入3）CS-Thinker，随着动作的展开实现跨场景时空一致性。它从CV-Aligner的对象语义令牌中学习局部动态的隐式知识，并从CO-Fuser的几何令牌中学习全局深度信息，从而在场景变化下增强高效的视觉推理。大量实验表明，得益于其高效的时空一致性设计，ConsisVLA-4D在LIBERO基准和真实世界平台上分别实现了21.6%和41.5%的性能提升，以及相比OpenVLA 2.3倍和2.4倍的推理加速。ConsisVLA-4D已开源，并公开于。