In recent years, advancements in large language models have been remarkable, with models such as ChatGPT demonstrating exceptional proficiency in diverse linguistic tasks. The pre-training of large models with billions of parameters, poses a formidable challenge, primarily due to the scarcity of datasets of a commensurate scale for effective training. Nevertheless, innovative strategies have emerged, including methods to fine-tune these pre-trained models using fewer parameters set, as evidenced by models like MiniGPT-4 and LLaVA. Despite their potential in various domains, these models remain limited in their understanding of artistic imagery. They have yet to fully grasp the intricate nuances of art images or to provide an objective articulation of the emotions they evoke, in a manner akin to human perception. This work introduces ArtGPT-4, a pioneering large vision-language model tailored to address the deficiencies of contemporary models in artistic comprehension. ArtGPT-4 underwent training on image-text pairs utilizing a Tesla A100 device in a mere 2 hours, with a dataset comprising approximately 0.52M entries. Impressively, the model can render images with an artistic-understanding and convey the emotions they inspire, mirroring human interpretation. Additionally, this work presents a unique dataset designed to evaluate the efficacy of vision-language models. In subsequent evaluations, ArtGPT-4 not only achieved state-of-the-art performance on the ArtEmis and ArtEmis-v2.0 datasets but also exceeded the established benchmarks introduced in This study, lagging behind professional artists' descriptions by a negligible 0.15 points on a 6-point scale. The code and the pre-trained model are accessible in https://huggingface.co/Tyrannosaurus/ArtGPT-4.
翻译:近年来,大语言模型取得了显著进展,ChatGPT等模型在各种语言任务中展现出卓越的能力。然而,预训练具有数十亿参数的大模型面临巨大挑战,主要源于缺乏同等规模的有效训练数据集。尽管新兴策略层出不穷,例如MiniGPT-4和LLaVA等模型采用少量参数集对预训练模型进行微调的方法,但这些模型在艺术图像理解方面仍存在局限——它们尚未能充分把握艺术图像的微妙内涵,也无法以类似人类感知的方式客观阐述图像所引发的情感。本文提出ArtGPT-4,这是一种开创性大视觉语言模型,旨在解决当前模型在艺术理解方面的不足。ArtGPT-4在Tesla A100设备上仅需2小时即可完成图像-文本对的训练,使用的数据集包含约52万条记录。引人注目的是,该模型能以艺术理解视角呈现图像内容,并传达其所激发的情感,与人类解读方式相契合。此外,本研究还提出了一个用于评估视觉语言模型效能的独特数据集。在后续评估中,ArtGPT-4不仅在ArtEmis和ArtEmis-v2.0数据集上取得了最先进性能,还超越了本研究构建的基准指标,在6分量表上与专业艺术家描述的差距仅0.15分。代码与预训练模型可在https://huggingface.co/Tyrannosaurus/ArtGPT-4 获取。