Robotic grasping is a fundamental capability in robotic manipulation. Yet grasping remains challenging under partial observations. Reliable grasping depends on both local contact cues and object-level 3D structure. Existing geometry-aware grasping methods recognize the value of reconstruction, but they typically treat geometry as an intermediate prediction rather than a reusable object prior for grasping. In this paper, we present GraspFoM, a unified framework that leverages 3D foundation priors (SAM3D) to build a shared 3D object latent for both reconstruction and grasp pose prediction. Built on this shared object latent, we introduce an anchor-initialized truncated pose-reasoning diffuser that predicts continuous and multimodal grasp poses without directly relying on discrete grasp candidates. We further investigate the interaction between reconstruction and grasping through a reconstruction-aware scorer and a residual latent updater. Reconstruction provides grounded geometric cues, while grasp supervision refines the shared object latent toward grasp-relevant affordances. GraspFoM jointly predicts grasp poses and reconstructs high-fidelity 3D assets in mesh and 3DGS forms. Comprehensive experiments demonstrate that GraspFoM achieves state-of-the-art results on both reconstruction and grasping. Notably, these improvements require only a small number of additional trainable parameters. Component-wise ablation studies also demonstrate the contribution of each component.
翻译:机器人抓取是机器人操作中的一项基础能力。然而,在部分观测条件下,抓取仍具有挑战性。可靠的抓取既依赖局部接触线索,也依赖物体级的三维结构。现有几何感知抓取方法虽认识到重建的价值,但通常将几何视为中间预测,而非可复用的物体先验。本文提出 GraspFoM 统一框架,利用三维基础先验(SAM3D)构建共享的三维物体隐空间,同时服务于重建与抓取姿态预测。基于该共享隐空间,我们引入锚点初始化的截断位姿推理扩散器,以预测连续且多模态的抓取姿态,无需直接依赖离散抓取候选。进一步,通过重建感知评分器与残差隐空间更新器,探究重建与抓取间的交互机制:重建提供有据可依的几何线索,而抓取监督引导共享隐空间向抓取相关功能区域精化。GraspFoM 可联合预测抓取姿态并重建高保真的网格与三维高斯泼溅形式的三维资产。综合实验表明,GraspFoM 在重建与抓取任务中均取得最优结果。值得注意的是,这些提升仅需少量额外可训练参数。组件消融实验亦验证了各组件的贡献。