ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter

In recent years, advancements in large language models have been remarkable, with models such as ChatGPT demonstrating exceptional proficiency in diverse linguistic tasks. The pre-training of large models with billions of parameters, poses a formidable challenge, primarily due to the scarcity of datasets of a commensurate scale for effective training. Nevertheless, innovative strategies have emerged, including methods to fine-tune these pre-trained models using fewer parameters set, as evidenced by models like MiniGPT-4 and LLaVA. Despite their potential in various domains, these models remain limited in their understanding of artistic imagery. They have yet to fully grasp the intricate nuances of art images or to provide an objective articulation of the emotions they evoke, in a manner akin to human perception. This work introduces ArtGPT-4, a pioneering large vision-language model tailored to address the deficiencies of contemporary models in artistic comprehension. ArtGPT-4 underwent training on image-text pairs utilizing a Tesla A100 device in a mere 2 hours, with a dataset comprising approximately 0.52M entries. Impressively, the model can render images with an artistic-understanding and convey the emotions they inspire, mirroring human interpretation. Additionally, this work presents a unique dataset designed to evaluate the efficacy of vision-language models. In subsequent evaluations, ArtGPT-4 not only achieved state-of-the-art performance on the ArtEmis and ArtEmis-v2.0 datasets but also exceeded the established benchmarks introduced in This study, lagging behind professional artists' descriptions by a negligible 0.15 points on a 6-point scale. The code and the pre-trained model are accessible in https://huggingface.co/Tyrannosaurus/ArtGPT-4.

翻译：近年来，大语言模型取得了显著进展，ChatGPT等模型在各种语言任务中展现出卓越的能力。然而，预训练具有数十亿参数的大模型面临巨大挑战，主要源于缺乏同等规模的有效训练数据集。尽管新兴策略层出不穷，例如MiniGPT-4和LLaVA等模型采用少量参数集对预训练模型进行微调的方法，但这些模型在艺术图像理解方面仍存在局限——它们尚未能充分把握艺术图像的微妙内涵，也无法以类似人类感知的方式客观阐述图像所引发的情感。本文提出ArtGPT-4，这是一种开创性大视觉语言模型，旨在解决当前模型在艺术理解方面的不足。ArtGPT-4在Tesla A100设备上仅需2小时即可完成图像-文本对的训练，使用的数据集包含约52万条记录。引人注目的是，该模型能以艺术理解视角呈现图像内容，并传达其所激发的情感，与人类解读方式相契合。此外，本研究还提出了一个用于评估视觉语言模型效能的独特数据集。在后续评估中，ArtGPT-4不仅在ArtEmis和ArtEmis-v2.0数据集上取得了最先进性能，还超越了本研究构建的基准指标，在6分量表上与专业艺术家描述的差距仅0.15分。代码与预训练模型可在https://huggingface.co/Tyrannosaurus/ArtGPT-4 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日