Learning to segment images purely by relying on the image-text alignment from web data can lead to sub-optimal performance due to noise in the data. The noise comes from the samples where the associated text does not correlate with the image's visual content. Instead of purely relying on the alignment from the noisy data, this paper proposes a novel loss function termed SimCon, which accounts for intra-modal similarities to determine the appropriate set of positive samples to align. Further, using multiple views of the image (created synthetically) for training and combining the SimCon loss with it makes the training more robust. This version of the loss is termed MV-SimCon. The empirical results demonstrate that using the proposed loss function leads to consistent improvements on zero-shot, text supervised semantic segmentation and outperforms state-of-the-art by $+3.0\%$, $+3.3\%$ and $+6.9\%$ on PASCAL VOC, PASCAL Context and MSCOCO, respectively. With test time augmentations, we set a new record by improving these results further to $58.7\%$, $26.6\%$, and $33.3\%$ on PASCAL VOC, PASCAL Context, and MSCOCO, respectively. In addition, using the proposed loss function leads to robust training and faster convergence.
翻译:纯粹依赖网络数据中的图像-文本对齐来学习图像分割,可能因数据噪声导致性能欠佳——这些噪声源自关联文本与图像视觉内容不匹配的样本。本文不单纯依赖含噪数据的对齐关系,提出新型损失函数SimCon,通过利用模态内相似性确定合适的正样本对齐集。进一步,采用多视角图像(合成生成)进行训练,并将SimCon损失与之结合,使训练过程更具鲁棒性,该版本损失函数称为MV-SimCon。实验结果表明,采用所提损失函数在零样本文本监督语义分割任务中取得持续改进,在PASCAL VOC、PASCAL Context和MSCOCO数据集上分别以+3.0%、+3.3%和+6.9%的优势超越现有最优方法。结合测试时增强技术,我们在上述数据集上分别将结果提升至58.7%、26.6%和33.3%,创下新纪录。此外,所提损失函数能实现鲁棒训练并加速收敛。