The use of Deep Learning and Computer Vision in the Cultural Heritage domain is becoming highly relevant in the last few years with lots of applications about audio smart guides, interactive museums and augmented reality. All these technologies require lots of data to work effectively and be useful for the user. In the context of artworks, such data is annotated by experts in an expensive and time consuming process. In particular, for each artwork, an image of the artwork and a description sheet have to be collected in order to perform common tasks like Visual Question Answering. In this paper we propose a method for Visual Question Answering that allows to generate at runtime a description sheet that can be used for answering both visual and contextual questions about the artwork, avoiding completely the image and the annotation process. For this purpose, we investigate on the use of GPT-3 for generating descriptions for artworks analyzing the quality of generated descriptions through captioning metrics. Finally we evaluate the performance for Visual Question Answering and captioning tasks.
翻译:近年来,深度学习和计算机视觉在文化遗产领域的应用日益重要,涌现出大量关于智能语音导览、互动博物馆及增强现实的应用。这些技术需要大量数据才能有效运作并为用户提供实用价值。在艺术品场景中,此类数据需由专家进行标注,过程昂贵且耗时。具体而言,为执行视觉问答等常见任务,每件艺术品需收集其图像和说明文档。本文提出一种视觉问答方法,可在运行时动态生成说明文档,用于回答关于艺术品的视觉与上下文问题,从而完全避免图像采集和标注过程。为此,我们研究了利用GPT-3生成艺术品描述的可能性,并通过描述性指标分析生成文本的质量。最后,我们评估了该方法在视觉问答和描述生成任务上的性能。