Contemporary cutting-edge open-vocabulary segmentation approaches commonly rely on image-mask-text triplets, yet this restricted annotation is labour-intensive and encounters scalability hurdles in complex real-world scenarios. Although some methods are proposed to reduce the annotation cost with only text supervision, the incompleteness of supervision severely limits the versatility and performance. In this paper, we liberate the strict correspondence between masks and texts by using independent image-mask and image-text pairs, which can be easily collected respectively. With this unpaired mask-text supervision, we propose a new weakly-supervised open-vocabulary segmentation framework (Uni-OVSeg) that leverages confident pairs of mask predictions and entities in text descriptions. Using the independent image-mask and image-text pairs, we predict a set of binary masks and associate them with entities by resorting to the CLIP embedding space. However, the inherent noise in the correspondence between masks and entities poses a significant challenge when obtaining reliable pairs. In light of this, we advocate using the large vision-language model (LVLM) to refine text descriptions and devise a multi-scale ensemble to stablise the matching between masks and entities. Compared to text-only weakly-supervised methods, our Uni-OVSeg achieves substantial improvements of 15.5% mIoU on the ADE20K datasets, and even surpasses fully-supervised methods on the challenging PASCAL Context-459 dataset.
翻译:当代前沿的开放词汇分割方法通常依赖于图像-掩码-文本三元组,但这种严格的标注方式不仅劳动密集,还在复杂现实场景中面临可扩展性瓶颈。尽管已有方法尝试仅通过文本监督降低标注成本,但监督信息的不完整性严重限制了其通用性和性能。本文通过利用可独立收集的图像-掩码对与图像-文本对,彻底解放了掩码与文本之间的严格对应关系。基于这种非配对掩码-文本监督,我们提出了一种新型弱监督开放词汇分割框架(Uni-OVSeg),该框架能够利用掩码预测结果与文本描述中实体之间的高置信度配对关系。通过独立的图像-掩码对和图像-文本对,我们预测一组二值掩码,并借助CLIP嵌入空间将其与实体关联。然而,掩码与实体间固有的对应噪声给可靠配对的获取带来重大挑战。为此,我们提出利用大型视觉语言模型(LVLM)优化文本描述,并设计多尺度集成策略以稳定掩码与实体间的匹配。与仅依赖文本的弱监督方法相比,我们的Uni-OVSeg在ADE20K数据集上实现了15.5% mIoU的显著提升,甚至在极具挑战性的PASCAL Context-459数据集上超越了全监督方法。