Recent data-driven image colorization methods have enabled automatic or reference-based colorization, while still suffering from unsatisfactory and inaccurate object-level color control. To address these issues, we propose a new method called DiffColor that leverages the power of pre-trained diffusion models to recover vivid colors conditioned on a prompt text, without any additional inputs. DiffColor mainly contains two stages: colorization with generative color prior and in-context controllable colorization. Specifically, we first fine-tune a pre-trained text-to-image model to generate colorized images using a CLIP-based contrastive loss. Then we try to obtain an optimized text embedding aligning the colorized image and the text prompt, and a fine-tuned diffusion model enabling high-quality image reconstruction. Our method can produce vivid and diverse colors with a few iterations, and keep the structure and background intact while having colors well-aligned with the target language guidance. Moreover, our method allows for in-context colorization, i.e., producing different colorization results by modifying prompt texts without any fine-tuning, and can achieve object-level controllable colorization results. Extensive experiments and user studies demonstrate that DiffColor outperforms previous works in terms of visual quality, color fidelity, and diversity of colorization options.
翻译:近期的数据驱动图像着色方法虽已实现自动或基于参考的着色,但仍在物体级颜色控制方面存在效果不佳且不精确的问题。针对这些挑战,我们提出名为DiffColor的新方法,利用预训练扩散模型的强大能力,仅需输入提示文本即可恢复生动的色彩,无需任何额外信息。DiffColor主要包含两个阶段:基于生成颜色先验的着色与上下文可控制着色。具体而言,我们首先通过CLIP对比损失对预训练的文本到图像模型进行微调,使其生成着色图像;随后优化文本嵌入以对齐着色图像与文本提示,并微调扩散模型以实现高质量图像重建。我们的方法仅需少量迭代即可生成生动多样的色彩,在保持结构与背景完整性的同时,使颜色与目标语言指导高度一致。此外,该方法支持上下文着色——即通过修改提示文本即可生成不同着色结果而无需额外微调,并实现物体级可控着色。大量实验与用户研究表明,DiffColor在视觉质量、色彩保真度及着色方案多样性方面均优于现有方法。