GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the original KnowledgeBank mainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to control memory quality, conflicts, confidence, and geometric relevance. To address the first challenge, we introduce GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D reconstruction branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgrade KnowledgeBank into a governed long-term memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata, together with precision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified; GeoFuse-MV3D improves over the MV-SAM3D baseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, and KnowledgeBank improves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively. Code: https://github.com/AIGeeksGroup/GeneralVLA-2. Website: https://aigeeksgroup.github.io/GeneralVLA-2.

翻译：通用视觉-语言-动作系统需要以物体为中心的3D证据和可复用的操作经验来规划可靠的机器人轨迹。GeneralVLA提供了将语言和RGB-D观测数据转化为3D末端执行器路径的分层接口，但仍存在两个瓶颈。首先，单目SAM3D风格的目标重建可能在多视角校准观测下产生姿态和未见几何结构的幻觉，而操作稳定性依赖于稳定的物体形状。其次，原始KnowledgeBank主要检索语义相似片段并追加新知识，难以控制记忆质量、冲突、置信度和几何相关性。针对第一个挑战，我们提出GeoFuse-MV3D——一种几何先验引导的MV-SAM3D重建分支，通过输入视角掩码验证外部几何线索、应用软视觉外壳支撑、执行轴向精化，并在保留外观的同时仅融合几何信息。针对第二个挑战，我们将KnowledgeBank升级为受控长期记忆系统，通过显式的质量、置信度、生命周期、验证器和冲突元数据，结合精度导向的检索机制。最后，我们在GSO-30上评估重建分支，在Terminal-Bench 2.0和SWE-Bench Verified上评估记忆模块：GeoFuse-MV3D相比MV-SAM3D基线将CD和LPIPS分别降低2.20%和2.02%，同时将PSNR和SSIM提升2.36%和1.03%；KnowledgeBank相比ReasoningBank在Terminal-Bench SR和SWE-Bench解决率上分别提升4.53%和3.73%，同时将AS分别降低4.95%和5.65%。代码：https://github.com/AIGeeksGroup/GeneralVLA-2。网站：https://aigeeksgroup.github.io/GeneralVLA-2。