Simultaneous 3D reconstruction and 6D object pose estimation from a single monocular image is an inherently ill-posed problem. In industrial settings, however, multiple instances of an object are often randomly arranged in bins, implicitly providing several views of the same object within a single image. We show that this implicit multi-view geometry can be exploited to simultaneously reconstruct the object in 3D and estimate the 6D pose of each visible object instance. We present MooMIns, a new Gaussian-splatting-based approach that inverts the original Gaussian splatting formulation: instead of rendering a single scene from multiple cameras, we render multiple object instances from a single camera. Our method is initialized with SAM3 instance segmentation masks and a modified Structure from Motion (SfM) pipeline. In contrast to learned monocular depth estimation, we perform true geometry-based reconstruction from image evidence, avoiding hallucinations caused by training data priors. We evaluate MooMIns on synthetic and real bin-picking scenarios, and demonstrate accurate reconstruction of previously unseen objects as well as reliable pose estimation of individual instance
翻译:从单目图像中同时进行三维重建与6D物体姿态估计本质上是一个病态问题。然而在工业场景中,物体常被随机堆叠在料箱中,其多个实例隐式地在单张图像中提供了同一物体的多视角信息。我们证明,这种隐式多视角几何可被利用来同时实现物体的三维重建与每个可见物体实例的6D姿态估计。本文提出MooMIns——一种基于高斯溅射的新方法,该方法逆向运用原始高斯溅射公式:并非从多台相机渲染单一场景,而是从单台相机渲染多个物体实例。该方法通过SAM3实例分割掩膜和改进的运动恢复结构(SfM)流水线进行初始化。与基于学习的单目深度估计不同,我们依靠图像证据实现真正的几何驱动重建,避免了训练数据先验导致的幻觉问题。我们在合成与真实抓取场景中对MooMIns进行评估,证明了其对未见物体的精确重建能力以及对单个实例的可靠姿态估计效果。