Current image captioning systems lack the ability to link descriptive text to specific visual elements, making their outputs difficult to verify. While recent approaches offer some grounding capabilities, they cannot track object identities across multiple references or ground both actions and objects simultaneously. We propose a novel ID-based grounding system that enables consistent object reference tracking and action-object linking, and present GroundCap, a dataset containing 52,016 images from 77 movies, with 344 human-annotated and 52,016 automatically generated captions. Each caption is grounded on detected objects (132 classes) and actions (51 classes) using a tag system that maintains object identity while linking actions to the corresponding objects. Our approach features persistent object IDs for reference tracking, explicit action-object linking, and segmentation of background elements through K-means clustering. We propose gMETEOR, a metric combining caption quality with grounding accuracy, and establish baseline performance by fine-tuning Pixtral-12B. Human evaluation demonstrates our approach's effectiveness in producing verifiable descriptions with coherent object references.
翻译:当前的图像描述系统缺乏将描述性文本与特定视觉元素关联的能力,导致其输出难以验证。尽管近期方法提供了一定的基础能力,但它们无法跨多个指代跟踪对象身份,也无法同时基础化动作和对象。我们提出了一种新颖的基于ID的基础系统,该系统支持一致的对象指代跟踪和动作-对象关联,并介绍了GroundCap数据集。该数据集包含来自77部电影的52,016张图像,配有344条人工标注和52,016条自动生成的描述。每条描述均通过标签系统基础于检测到的对象(132类)和动作(51类),该系统在保持对象身份的同时将动作与相应对象关联。我们的方法具有以下特点:用于指代跟踪的持久对象ID、显式的动作-对象关联,以及通过K-means聚类实现的背景元素分割。我们提出了gMETEOR指标,该指标结合了描述质量与基础准确性,并通过微调Pixtral-12B建立了基线性能。人工评估表明,我们的方法在生成具有连贯对象指代的可验证描述方面具有显著效果。