Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.
翻译:近年来,视觉语言模型(VLM)的进展改善了文化遗产领域的图像描述生成能力。然而,从视觉输入推断结构化文化元数据(如创作者、来源地、时期)仍是一个未被充分探索的课题。我们针对此任务提出了一个多类别、跨文化的基准,并采用基于大语言模型(LLM)的评判框架对VLM进行评估,该框架通过测量与参考标注的语义对齐程度来评价模型表现。为了评估文化推理能力,我们报告了跨文化区域的精确匹配、部分匹配及属性级准确率。结果表明,模型仅能捕捉到碎片化信号,且在不同文化和元数据类型上表现出显著性能差异,导致预测结果不一致且缺乏可靠依据。这些发现揭示了当前VLM在超越视觉感知的结构化文化元数据推断方面的局限性。