Image Captioning generates descriptive sentences from images using Vision-Language Pre-trained models (VLPs) such as BLIP, which has improved greatly. However, current methods lack the generation of detailed descriptive captions for the cultural elements depicted in the images, such as the traditional clothing worn by people from Asian cultural groups. In this paper, we propose a new framework, Culturally-aware Image Captioning (CIC), that generates captions and describes cultural elements extracted from cultural visual elements in images representing cultures. Inspired by methods combining visual modality and Large Language Models (LLMs) through appropriate prompts, our framework (1) generates questions based on cultural categories from images, (2) extracts cultural visual elements from Visual Question Answering (VQA) using generated questions, and (3) generates culturally-aware captions using LLMs with the prompts. Our human evaluation conducted on 45 participants from 4 different cultural groups with a high understanding of the corresponding culture shows that our proposed framework generates more culturally descriptive captions when compared to the image captioning baseline based on VLPs. Resources can be found at https://shane3606.github.io/cic..
翻译:图像描述生成利用视觉语言预训练模型(如BLIP)从图像中生成描述性语句,其性能已得到显著提升。然而,现有方法缺乏对图像中文化元素(例如亚洲文化群体所穿着的传统服饰)生成细节化描述的能力。本文提出一种新框架——文化感知图像描述生成(CIC),该框架能够生成描述并阐释从表征文化的图像中提取出的文化视觉元素所蕴含的文化要素。受通过适当提示词结合视觉模态与大型语言模型的方法启发,本框架(1)基于图像中的文化类别生成问题,(2)利用生成的问题通过视觉问答提取文化视觉元素,(3)结合提示词使用大型语言模型生成文化感知描述。我们对来自4个不同文化群体且对相应文化有深入理解的45名参与者进行人工评估,结果表明,与基于视觉语言预训练模型的图像描述基线方法相比,我们提出的框架能够生成更具文化描述性的文本。相关资源可在 https://shane3606.github.io/cic 获取。