RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement

In this paper we propose a novel modification of Contrastive Language-Image Pre-Training (CLIP) guidance for the task of unsupervised backlit image enhancement. Our work builds on the state-of-the-art CLIP-LIT approach, which learns a prompt pair by constraining the text-image similarity between a prompt (negative/positive sample) and a corresponding image (backlit image/well-lit image) in the CLIP embedding space. Learned prompts then guide an image enhancement network. Based on the CLIP-LIT framework, we propose two novel methods for CLIP guidance. First, we show that instead of tuning prompts in the space of text embeddings, it is possible to directly tune their embeddings in the latent space without any loss in quality. This accelerates training and potentially enables the use of additional encoders that do not have a text encoder. Second, we propose a novel approach that does not require any prompt tuning. Instead, based on CLIP embeddings of backlit and well-lit images from training data, we compute the residual vector in the embedding space as a simple difference between the mean embeddings of the well-lit and backlit images. This vector then guides the enhancement network during training, pushing a backlit image towards the space of well-lit images. This approach further dramatically reduces training time, stabilizes training and produces high quality enhanced images without artifacts, both in supervised and unsupervised training regimes. Additionally, we show that residual vectors can be interpreted, revealing biases in training data, and thereby enabling potential bias correction.

翻译：本文提出了一种基于对比语言-图像预训练（CLIP）引导的无监督逆光图像增强任务的改进方法。我们的工作以当前最先进的CLIP-LIT方法为基础，该方法通过在CLIP嵌入空间中约束提示词（负/正样本）与对应图像（逆光/正常光照图像）之间的文本-图像相似度来学习提示词对，进而引导图像增强网络。基于CLIP-LIT框架，我们提出了两种新颖的CLIP引导方法。首先，我们证明无需在文本嵌入空间中对提示词进行调优，而是可以直接在潜在空间中优化其嵌入向量，且质量无损。这一方法加速了训练过程，并可能支持使用不含文本编码器的其他编码器。其次，我们提出了一种无需任何提示词调优的新方法：基于训练数据中逆光与正常光照图像的CLIP嵌入向量，通过计算两类图像均值嵌入向量的简单差值，直接得到嵌入空间中的残差向量。该向量在训练过程中引导增强网络，将逆光图像推向正常光照图像空间。这一方法进一步大幅缩短了训练时间，稳定了训练过程，并在监督与无监督训练模式下均能生成无伪影的高质量增强图像。此外，我们发现残差向量具有可解释性，能够揭示训练数据中的偏差，从而支持潜在的偏差修正。