Pixel grounding, encompassing tasks such as Referring Expression Segmentation (RES), has garnered considerable attention due to its immense potential for bridging the gap between vision and language modalities. However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce GroundingSuite, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. The GroundingSuite training dataset facilitates substantial performance improvements, enabling models trained on it to achieve state-of-the-art results. Specifically, a cIoU of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the GroundingSuite annotation framework demonstrates superior efficiency compared to the current leading data annotation method, i.e., $4.5 \times$ faster than the GLaMM.
翻译:像素级指代任务,例如指代表达式分割(RES),因其在弥合视觉与语言模态鸿沟方面的巨大潜力而备受关注。然而,该领域的发展目前受限于现有数据集的固有缺陷,包括对象类别有限、文本多样性不足以及高质量标注稀缺。为缓解这些限制,我们提出了GroundingSuite,其包含:(1)一个利用多视觉语言模型(VLM)代理的自动化数据标注框架;(2)一个包含956万条多样化指代表达式及其对应分割掩码的大规模训练数据集;(3)一个精心构建的评估基准,包含3,800张图像。GroundingSuite训练数据集能显著提升模型性能,使基于其训练的模型达到最先进水平。具体而言,在gRefCOCO上获得68.9的cIoU,在RefCOCOm上获得55.3的gIoU。此外,GroundingSuite标注框架相比当前领先的数据标注方法展现出卓越的效率,即比GLaMM快$4.5 \times$。