In recent years, advancements in large language models have been remarkable, with models such as ChatGPT demonstrating exceptional proficiency in diverse linguistic tasks. The pre-training of large models with billions of parameters, poses a formidable challenge, primarily due to the scarcity of datasets of a commensurate scale for effective training. Nevertheless, innovative strategies have emerged, including methods to fine-tune these pre-trained models using fewer parameters set, as evidenced by models like MiniGPT-4 and LLaVA. Despite their potential in various domains, these models remain limited in their understanding of artistic imagery. They have yet to fully grasp the intricate nuances of art images or to provide an objective articulation of the emotions they evoke, in a manner akin to human perception. This work introduces ArtGPT-4, a pioneering large vision-language model tailored to address the deficiencies of contemporary models in artistic comprehension. ArtGPT-4 underwent training on image-text pairs utilizing a Tesla A100 device in a mere 2 hours, with a dataset comprising approximately 0.52M entries. Impressively, the model can render images with an artistic-understanding and convey the emotions they inspire, mirroring human interpretation. Additionally, this work presents a unique dataset designed to evaluate the efficacy of vision-language models. In subsequent evaluations, ArtGPT-4 not only achieved state-of-the-art performance on the ArtEmis and ArtEmis-v2.0 datasets but also exceeded the established benchmarks introduced in This study, lagging behind professional artists' descriptions by a negligible 0.15 points on a 6-point scale. The code and the pre-trained model are accessible in https://huggingface.co/Tyrannosaurus/ArtGPT-4.
翻译:近年来,大型语言模型取得了显著进展,ChatGPT等模型在各种语言任务中展现出卓越能力。然而,预训练具有数十亿参数的大模型面临严峻挑战,主要源于缺乏同等规模的有效训练数据集。尽管如此,创新策略已相继涌现,包括使用更少参数集对这些预训练模型进行微调的方法,如MiniGPT-4和LLaVA等模型所验证。尽管它们在多个领域具有潜力,但当前模型在艺术图像理解方面仍存在局限,未能充分把握艺术图像的微妙内涵,也无法像人类感知那样客观表达其引发的情感。本研究提出ArtGPT-4,这是一种开创性的大视觉语言模型,专门用于弥补当前模型在艺术理解方面的缺陷。ArtGPT-4在特斯拉A100设备上仅用2小时完成约52万条图像-文本对的训练。令人瞩目的是,该模型能够渲染具有艺术理解的图像,并传达其所激发的情感,如同人类诠释一般。此外,本研究还提出了一个独特的评估视觉语言模型效能的专用数据集。在后续评估中,ArtGPT-4不仅在ArtEmis和ArtEmis-v2.0数据集上取得最先进性能,且超出本研究设定的基准,在6分量表上仅落后专业艺术家描述0.15分。模型代码与预训练权重可在https://huggingface.co/Tyrannosaurus/ArtGPT-4获取。