While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $ρ$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.
翻译:尽管视觉语言模型(VLMs)在精细图像描述方面取得了进展,但评估仍是一个挑战。标准指标(如CIDEr、SPICE)是为短文本设计的,并针对识别现已不常见的错误(如对象误识别)进行了优化。相比之下,长文本需要对属性和关系描述的敏感性,以及能将错误定位到特定文本跨度的评分方法。在本研究中,我们提出了PoSh,一种用于精细图像描述的评估指标,它利用场景图作为结构化评分标准来引导LLMs-as-a-Judge,从而生成基于细粒度错误(如组合理解错误)的聚合分数。PoSh具有可复现性、可解释性,并且比现有指标(包括GPT4o-as-a-Judge)更能代表人类评分者的判断。为了验证PoSh,我们引入了一个具有挑战性的新数据集DOCENT。该新颖基准包含艺术品、专家撰写的参考描述以及模型生成的描述,并辅以艺术史学生对其质量的细粒度和粗粒度评判。因此,DOCENT能够在具有挑战性的新领域中同时评估精细图像描述指标和精细图像描述本身。我们证明,PoSh与DOCENT中人类评判的相关性(Spearman $ρ$ +0.05)优于最佳的开源权重替代方案,对图像类型具有鲁棒性(使用现有网络图像数据集CapArena进行验证),并且能作为有效的奖励函数,其表现优于标准的有监督微调方法。随后,利用PoSh,我们分析了开源和闭源模型在描述DOCENT中绘画、素描和雕塑作品时的性能,发现基础模型难以对具有丰富场景动态的图像实现完整且无错误的覆盖,这为衡量VLM进展确立了一项极具挑战性的新任务。通过PoSh和DOCENT,我们希望推动辅助文本生成等重要领域的发展。