In recent years, large language models (LLMs) have made significant progress in natural language processing (NLP), with models like ChatGPT and GPT-4 achieving impressive capabilities in various linguistic tasks. However, training models on such a large scale is challenging, and finding datasets that match the model's scale is often difficult. Fine-tuning and training models with fewer parameters using novel methods have emerged as promising approaches to overcome these challenges. One such model is MiniGPT-4, which achieves comparable vision-language understanding to GPT-4 by leveraging novel pre-training models and innovative training strategies. However, the model still faces some challenges in image understanding, particularly in artistic pictures. A novel multimodal model called ArtGPT-4 has been proposed to address these limitations. ArtGPT-4 was trained on image-text pairs using a Tesla A100 device in just 2 hours, using only about 200 GB of data. The model can depict images with an artistic flair and generate visual code, including aesthetically pleasing HTML/CSS web pages. Furthermore, the article proposes novel benchmarks for evaluating the performance of vision-language models. In the subsequent evaluation methods, ArtGPT-4 scored more than 1 point higher than the current \textbf{state-of-the-art} model and was only 0.25 points lower than artists on a 6-point scale. Our code and pre-trained model are available at \url{https://huggingface.co/Tyrannosaurus/ArtGPT-4}.
翻译:近年来,大型语言模型(LLMs)在自然语言处理(NLP)领域取得了显著进展,ChatGPT、GPT-4等模型在各类语言任务中展现出卓越能力。然而,如此大规模模型的训练存在挑战,且匹配模型规模的数据集往往难以获取。通过新颖方法微调及训练参数更少的模型,已成为应对这些挑战的有前景方案。MiniGPT-4便是其中之一,其通过利用新型预训练模型和创新训练策略,实现了与GPT-4相当的视觉语言理解能力。但该模型在图像理解方面仍面临挑战,尤其是在艺术类图片中。为克服这些局限,本文提出了一种名为ArtGPT-4的新型多模态模型。ArtGPT-4使用Tesla A100设备,仅耗时2小时、利用约200GB数据在图像-文本对上完成训练。该模型能以艺术风格描绘图像,并生成包含美学HTML/CSS网页的视觉代码。此外,本文提出了评估视觉语言模型性能的新型基准。在后续评估方法中,ArtGPT-4在6分量表上的得分比当前最优模型高出1分以上,仅比艺术家低0.25分。我们的代码与预训练模型已开源至\url{https://huggingface.co/Tyrannosaurus/ArtGPT-4}。