We propose a novel unsupervised backlit image enhancement method, abbreviated as CLIP-LIT, by exploring the potential of Contrastive Language-Image Pre-Training (CLIP) for pixel-level image enhancement. We show that the open-world CLIP prior not only aids in distinguishing between backlit and well-lit images, but also in perceiving heterogeneous regions with different luminance, facilitating the optimization of the enhancement network. Unlike high-level and image manipulation tasks, directly applying CLIP to enhancement tasks is non-trivial, owing to the difficulty in finding accurate prompts. To solve this issue, we devise a prompt learning framework that first learns an initial prompt pair by constraining the text-image similarity between the prompt (negative/positive sample) and the corresponding image (backlit image/well-lit image) in the CLIP latent space. Then, we train the enhancement network based on the text-image similarity between the enhanced result and the initial prompt pair. To further improve the accuracy of the initial prompt pair, we iteratively fine-tune the prompt learning framework to reduce the distribution gaps between the backlit images, enhanced results, and well-lit images via rank learning, boosting the enhancement performance. Our method alternates between updating the prompt learning framework and enhancement network until visually pleasing results are achieved. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in terms of visual quality and generalization ability, without requiring any paired data.
翻译:我们提出一种新颖的无监督背光图像增强方法,简称CLIP-LIT,通过探索对比语言-图像预训练(CLIP)在像素级图像增强中的潜力。研究表明,开放世界的CLIP先验不仅有助于区分背光与正常光照图像,还能感知不同亮度的异质区域,从而促进增强网络的优化。与高层次及图像处理任务不同,直接应用CLIP到增强任务并非易事,原因在于难以找到准确的提示。为解决这一问题,我们设计了一个提示学习框架:首先通过约束CLIP隐空间中提示(负/正样本)与对应图像(背光图像/正常光照图像)之间的文本-图像相似性,学习初始提示对;随后,基于增强结果与初始提示对之间的文本-图像相似性训练增强网络。为进一步提升初始提示对的准确性,我们通过排序学习迭代微调提示学习框架,以缩小背光图像、增强结果和正常光照图像之间的分布差距,从而增强增强性能。该方法交替更新提示学习框架与增强网络,直至获得视觉上令人满意的结果。大量实验表明,本方法在无需任何配对数据的情况下,在视觉质量和泛化能力上均优于现有最先进方法。