Cultural heritage applications and advanced machine learning models are creating a fruitful synergy to provide effective and accessible ways of interacting with artworks. Smart audio-guides, personalized art-related content and gamification approaches are just a few examples of how technology can be exploited to provide additional value to artists or exhibitions. Nonetheless, from a machine learning point of view, the amount of available artistic data is often not enough to train effective models. Off-the-shelf computer vision modules can still be exploited to some extent, yet a severe domain shift is present between art images and standard natural image datasets used to train such models. As a result, this can lead to degraded performance. This paper introduces a novel approach to address the challenges of limited annotated data and domain shifts in the cultural heritage domain. By leveraging generative vision-language models, we augment art datasets by generating diverse variations of artworks conditioned on their captions. This augmentation strategy enhances dataset diversity, bridging the gap between natural images and artworks, and improving the alignment of visual cues with knowledge from general-purpose datasets. The generated variations assist in training vision and language models with a deeper understanding of artistic characteristics and that are able to generate better captions with appropriate jargon.
翻译:文化遗产应用与先进的机器学习模型正形成富有成效的协同效应,为艺术品交互提供有效且便捷的方式。智能语音导览、个性化艺术内容推荐及游戏化方法,仅是技术为艺术家或展览创造附加价值的若干实例。然而,从机器学习角度来看,可用的艺术数据量通常不足以训练有效的模型。虽然现成的计算机视觉模块在一定程度上仍可被利用,但艺术图像与训练此类模型所用的标准自然图像数据集之间存在显著的领域偏移,这可能导致性能下降。本文提出了一种新颖方法,以应对文化遗产领域中标注数据有限和领域偏移的挑战。通过利用生成式视觉-语言模型,我们基于艺术品的标题生成多样化的变体,从而扩充艺术数据集。这种扩充策略增强了数据集多样性,弥合了自然图像与艺术品之间的差距,并提升了视觉线索与通用数据集知识的对齐程度。生成的变体有助于训练视觉与语言模型,使其更深刻地理解艺术特征,并能够生成包含恰当术语的更优质描述。