Referring expression segmentation (RES), a task that involves localizing specific instance-level objects based on free-form linguistic descriptions, has emerged as a crucial frontier in human-AI interaction. It demands an intricate understanding of both visual and textual contexts and often requires extensive training data. This paper introduces RESMatch, the first semi-supervised learning (SSL) approach for RES, aimed at reducing reliance on exhaustive data annotation. Extensive validation on multiple RES datasets demonstrates that RESMatch significantly outperforms baseline approaches, establishing a new state-of-the-art. Although existing SSL techniques are effective in image segmentation, we find that they fall short in RES. Facing the challenges including the comprehension of free-form linguistic descriptions and the variability in object attributes, RESMatch introduces a trifecta of adaptations: revised strong perturbation, text augmentation, and adjustments for pseudo-label quality and strong-weak supervision. This pioneering work lays the groundwork for future research in semi-supervised learning for referring expression segmentation.
翻译:指代表达分割(RES)是一项基于自由形式语言描述定位特定实例级对象的任务,已成为人机交互领域的关键前沿。该任务要求深入理解视觉与文本语境,且通常需要大量训练数据。本文提出RESMatch,这是首个针对RES的半监督学习(SSL)方法,旨在减少对详尽数据标注的依赖。在多个RES数据集上的广泛验证表明,RESMatch显著优于基线方法,达到了新的最优水平。尽管现有SSL技术在图像分割中表现有效,但我们发现其在RES任务中存在不足。面对自由形式语言描述理解及对象属性可变性等挑战,RESMatch引入了三项适应性改进:修正的强扰动、文本增强、伪标签质量调整与强弱监督。这项开创性工作为指代表达分割的半监督学习未来研究奠定了基础。