Segmenting unseen objects from images is a critical perception skill that a robot needs to acquire. In robot manipulation, it can facilitate a robot to grasp and manipulate unseen objects. Mean shift clustering is a widely used method for image segmentation tasks. However, the traditional mean shift clustering algorithm is not differentiable, making it difficult to integrate it into an end-to-end neural network training framework. In this work, we propose the Mean Shift Mask Transformer (MSMFormer), a new transformer architecture that simulates the von Mises-Fisher (vMF) mean shift clustering algorithm, allowing for the joint training and inference of both the feature extractor and the clustering. Its central component is a hypersphere attention mechanism, which updates object queries on a hypersphere. To illustrate the effectiveness of our method, we apply MSMFormer to unseen object instance segmentation. Our experiments show that MSMFormer achieves competitive performance compared to state-of-the-art methods for unseen object instance segmentation. The video and code are available at https://irvlutd.github.io/MSMFormer
翻译:从图像中分割未知物体是机器人需要掌握的关键感知技能。在机器人操作中,这有助于机器人抓取和操纵未知物体。均值漂移聚类是图像分割任务中广泛使用的方法。然而,传统均值漂移聚类算法不可微,难以集成到端到端神经网络训练框架中。本文提出均值漂移掩码变换器(MSMFormer),这是一种新型变换器架构,模拟von Mises-Fisher(vMF)均值漂移聚类算法,实现特征提取器与聚类的联合训练与推理。其核心组件是超球面注意力机制,该机制在超球面上更新物体查询。为验证方法有效性,我们将MSMFormer应用于未知物体实例分割。实验表明,MSMFormer在未知物体实例分割任务中取得了与最先进方法相媲美的性能。视频和代码见https://irvlutd.github.io/MSMFormer