Adapting Pre-Trained Vision Models for Novel Instance Detection and Segmentation

Novel Instance Detection and Segmentation (NIDS) aims at detecting and segmenting novel object instances given a few examples of each instance. We propose a unified, simple yet effective framework (NIDS-Net) comprising object proposal generation, embedding creation for both instance templates and proposal regions, and embedding matching for instance label assignment. Leveraging recent advancements in large vision methods, we utilize Grounding DINO and Segment Anything Model (SAM) to obtain object proposals with accurate bounding boxes and masks. Central to our approach is the generation of high-quality instance embeddings. We utilize foreground feature averages of patch embeddings from the DINOv2 ViT backbone, followed by refinement through a weight adapter mechanism that we introduce. We show experimentally that our weight adapter can adjust the embeddings locally within their feature space and effectively limit overfitting in the few-shot setting. This methodology enables a straightforward matching strategy, resulting in significant performance gains. Our framework surpasses current state-of-the-art methods, demonstrating notable improvements of 22.3, 46.2, 10.3, and 24.0 in average precision (AP) across four detection datasets. In instance segmentation tasks on seven core datasets of the BOP challenge, our method is around 4.5 times faster than the leading published RGB method and surpasses it by 3.6 AP. NIDS-Net is about 5.7 times faster than the top RGB-D method while maintaining competitive performance. Project Page: https://irvlutd.github.io/NIDSNet/

翻译：新颖实例检测与分割（NIDS）旨在给定每个实例的少量示例后，检测并分割新颖物体实例。我们提出了一个统一、简洁而有效的框架（NIDS-Net），包含物体候选框生成、实例模板与候选区域嵌入向量构建，以及通过嵌入匹配进行实例标签分配。借助大规模视觉方法的最新进展，我们利用Grounding DINO和Segment Anything Model（SAM）获取具有精确边界框和掩码的物体候选框。我们方法的核心在于生成高质量的实例嵌入向量。我们采用DINOv2 ViT骨干网络提取的补丁嵌入向量的前景特征平均值，随后通过我们引入的权重适配器机制进行细化。实验表明，我们的权重适配器能在特征空间内局部调整嵌入向量，并有效限制小样本设置中的过拟合现象。该方法支持直接的匹配策略，从而带来显著的性能提升。我们的框架超越了当前最先进的方法，在四个检测数据集上平均精度（AP）分别实现了22.3、46.2、10.3和24.0的显著提升。在BOP挑战赛七个核心数据集的实例分割任务中，我们的方法比已发表的主流RGB方法快约4.5倍，并以3.6 AP超越其性能。NIDS-Net比顶级RGB-D方法快约5.7倍，同时保持具有竞争力的性能。项目页面：https://irvlutd.github.io/NIDSNet/