This paper presents a novel approach to human image colorization by fine-tuning the InstructPix2Pix model, which integrates a language model (GPT-3) with a text-to-image model (Stable Diffusion). Despite the original InstructPix2Pix model's proficiency in editing images based on textual instructions, it exhibits limitations in the focused domain of colorization. To address this, we fine-tuned the model using the IMDB-WIKI dataset, pairing black-and-white images with a diverse set of colorization prompts generated by ChatGPT. This paper contributes by (1) applying fine-tuning techniques to stable diffusion models specifically for colorization tasks, and (2) employing generative models to create varied conditioning prompts. After finetuning, our model outperforms the original InstructPix2Pix model on multiple metrics quantitatively, and we produce more realistically colored images qualitatively. The code for this project is provided on the GitHub Repository https://github.com/AllenAnZifeng/DeepLearning282.
翻译:本文提出一种通过微调InstructPix2Pix模型实现人物图像着色的新方法。该模型整合了语言模型GPT-3与文本到图像生成模型Stable Diffusion。尽管原始InstructPix2Pix模型在基于文本指令的图像编辑方面表现出色,但在着色这一特定领域仍存在局限性。为解决此问题,我们采用IMDB-WIKI数据集对模型进行微调,将黑白图像与由ChatGPT生成的多样化着色提示进行配对。本文的贡献包括:(1) 将微调技术应用于Stable Diffusion模型以专门处理着色任务;(2) 利用生成式模型创建多样化的条件提示。经过微调后,我们的模型在多项量化指标上均优于原始InstructPix2Pix模型,并在定性评估中生成更逼真的彩色图像。本项目的代码已发布在GitHub仓库https://github.com/AllenAnZifeng/DeepLearning282。