ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter

In recent years, advancements in large language models have been remarkable, with models such as ChatGPT demonstrating exceptional proficiency in diverse linguistic tasks. The pre-training of large models with billions of parameters, poses a formidable challenge, primarily due to the scarcity of datasets of a commensurate scale for effective training. Nevertheless, innovative strategies have emerged, including methods to fine-tune these pre-trained models using fewer parameters set, as evidenced by models like MiniGPT-4 and LLaVA. Despite their potential in various domains, these models remain limited in their understanding of artistic imagery. They have yet to fully grasp the intricate nuances of art images or to provide an objective articulation of the emotions they evoke, in a manner akin to human perception. This work introduces ArtGPT-4, a pioneering large vision-language model tailored to address the deficiencies of contemporary models in artistic comprehension. ArtGPT-4 underwent training on image-text pairs utilizing a Tesla A100 device in a mere 2 hours, with a dataset comprising approximately 0.52M entries. Impressively, the model can render images with an artistic-understanding and convey the emotions they inspire, mirroring human interpretation. Additionally, this work presents a unique dataset designed to evaluate the efficacy of vision-language models. In subsequent evaluations, ArtGPT-4 not only achieved state-of-the-art performance on the ArtEmis and ArtEmis-v2.0 datasets but also exceeded the established benchmarks introduced in This study, lagging behind professional artists' descriptions by a negligible 0.15 points on a 6-point scale. The code and the pre-trained model are accessible in https://huggingface.co/Tyrannosaurus/ArtGPT-4.

翻译：近年来，大型语言模型取得了显著进展，ChatGPT等模型在各种语言任务中展现出卓越能力。然而，预训练具有数十亿参数的大模型面临严峻挑战，主要源于缺乏同等规模的有效训练数据集。尽管如此，创新策略已相继涌现，包括使用更少参数集对这些预训练模型进行微调的方法，如MiniGPT-4和LLaVA等模型所验证。尽管它们在多个领域具有潜力，但当前模型在艺术图像理解方面仍存在局限，未能充分把握艺术图像的微妙内涵，也无法像人类感知那样客观表达其引发的情感。本研究提出ArtGPT-4，这是一种开创性的大视觉语言模型，专门用于弥补当前模型在艺术理解方面的缺陷。ArtGPT-4在特斯拉A100设备上仅用2小时完成约52万条图像-文本对的训练。令人瞩目的是，该模型能够渲染具有艺术理解的图像，并传达其所激发的情感，如同人类诠释一般。此外，本研究还提出了一个独特的评估视觉语言模型效能的专用数据集。在后续评估中，ArtGPT-4不仅在ArtEmis和ArtEmis-v2.0数据集上取得最先进性能，且超出本研究设定的基准，在6分量表上仅落后专业艺术家描述0.15分。模型代码与预训练权重可在https://huggingface.co/Tyrannosaurus/ArtGPT-4获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日