Referring expression segmentation (RES), a task that involves localizing specific instance-level objects based on free-form linguistic descriptions, has emerged as a crucial frontier in human-AI interaction. It demands an intricate understanding of both visual and textual contexts and often requires extensive training data. This paper introduces RESMatch, the first semi-supervised learning (SSL) approach for RES, aimed at reducing reliance on exhaustive data annotation. Extensive validation on multiple RES datasets demonstrates that RESMatch significantly outperforms baseline approaches, establishing a new state-of-the-art. Although existing SSL techniques are effective in image segmentation, we find that they fall short in RES. Facing the challenges including the comprehension of free-form linguistic descriptions and the variability in object attributes, RESMatch introduces a trifecta of adaptations: revised strong perturbation, text augmentation, and adjustments for pseudo-label quality and strong-weak supervision. This pioneering work lays the groundwork for future research in semi-supervised learning for referring expression segmentation.
翻译:指代表达分割(RES)任务旨在根据自由形式的语言描述定位特定实例级目标,已成为人机交互领域的关键前沿方向。该任务要求对视觉与文本语境进行精细理解,通常需要大量训练数据。本文提出RESMatch——首个针对RES任务的半监督学习(SSL)方法,旨在降低对密集数据标注的依赖。在多个RES数据集上的广泛验证表明,RESMatch显著优于基线方法,创造了新的最优性能。尽管现有SSL技术在图像分割中表现有效,但研究发现其在RES任务中存在局限性。针对自由形式语言描述理解及目标属性可变性等挑战,RESMatch引入三项适配改进:改进强扰动机制、文本增强策略、以及针对伪标签质量与强弱监督的调整。这项开创性工作为指代表达分割的半监督学习研究奠定了基础。