The colorization of grayscale images is a complex and subjective task with significant challenges. Despite recent progress in employing large-scale datasets with deep neural networks, difficulties with controllability and visual quality persist. To tackle these issues, we present a novel image colorization framework that utilizes image diffusion techniques with granular text prompts. This integration not only produces colorization outputs that are semantically appropriate but also greatly improves the level of control users have over the colorization process. Our method provides a balance between automation and control, outperforming existing techniques in terms of visual quality and semantic coherence. We leverage a pretrained generative Diffusion Model, and show that we can finetune it for the colorization task without losing its generative power or attention to text prompts. Moreover, we present a novel CLIP-based ranking model that evaluates color vividness, enabling automatic selection of the most suitable level of vividness based on the specific scene semantics. Our approach holds potential particularly for color enhancement and historical image colorization.
翻译:灰度图像的着色是一项复杂且主观的任务,面临显著挑战。尽管近期利用深度神经网络在大规模数据集上取得了进展,但可控性和视觉质量方面的困难依然存在。为解决这些问题,我们提出了一种新颖的图像着色框架,该框架利用图像扩散技术结合细粒度文本提示。这种整合不仅能生成语义上恰当的着色输出,还大幅提升了用户对着色过程的控制水平。我们的方法在自动化与可控性之间取得了平衡,在视觉质量和语义连贯性方面优于现有技术。我们利用了一个预训练的生成式扩散模型,并展示了如何在不损失其生成能力或对文本提示关注度的前提下,将其微调至着色任务。此外,我们提出了一种基于CLIP的排名模型来评估色彩鲜艳度,使其能够根据具体场景语义自动选择最合适的鲜艳度级别。我们的方法在色彩增强和历史图像着色方面尤其具有潜力。