Language-based colorization produces plausible and visually pleasing colors under the guidance of user-friendly natural language descriptions. Previous methods implicitly assume that users provide comprehensive color descriptions for most of the objects in the image, which leads to suboptimal performance. In this paper, we propose a unified model to perform language-based colorization with any-level descriptions. We leverage the pretrained cross-modality generative model for its robust language understanding and rich color priors to handle the inherent ambiguity of any-level descriptions. We further design modules to align with input conditions to preserve local spatial structures and prevent the ghosting effect. With the proposed novel sampling strategy, our model achieves instance-aware colorization in diverse and complex scenarios. Extensive experimental results demonstrate our advantages of effectively handling any-level descriptions and outperforming both language-based and automatic colorization methods. The code and pretrained models are available at: https://github.com/changzheng123/L-CAD.
翻译:基于语言的图像着色技术能在用户友好的自然语言描述引导下生成合理且视觉愉悦的色彩。现有方法隐含假设用户会为图像中大多数物体提供完整的颜色描述,导致性能欠佳。本文提出一种统一模型,支持基于任意层级描述的语言驱动着色。我们利用预训练的跨模态生成模型,凭借其强大的语言理解能力和丰富的颜色先验,以处理任意层级描述固有的歧义性。进一步设计模块与输入条件对齐,以保持局部空间结构并避免鬼影效应。通过提出的新型采样策略,我们的模型能在多样复杂场景中实现实例感知着色。大量实验结果表明,本方法能有效处理任意层级描述,在语言驱动和自动着色方法中均表现优异。代码与预训练模型已开源:https://github.com/changzheng123/L-CAD。