RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement

In this paper we propose a novel modification of Contrastive Language-Image Pre-Training (CLIP) guidance for the task of unsupervised backlit image enhancement. Our work builds on the state-of-the-art CLIP-LIT approach, which learns a prompt pair by constraining the text-image similarity between a prompt (negative/positive sample) and a corresponding image (backlit image/well-lit image) in the CLIP embedding space. Learned prompts then guide an image enhancement network. Based on the CLIP-LIT framework, we propose two novel methods for CLIP guidance. First, we show that instead of tuning prompts in the space of text embeddings, it is possible to directly tune their embeddings in the latent space without any loss in quality. This accelerates training and potentially enables the use of additional encoders that do not have a text encoder. Second, we propose a novel approach that does not require any prompt tuning. Instead, based on CLIP embeddings of backlit and well-lit images from training data, we compute the residual vector in the embedding space as a simple difference between the mean embeddings of the well-lit and backlit images. This vector then guides the enhancement network during training, pushing a backlit image towards the space of well-lit images. This approach further dramatically reduces training time, stabilizes training and produces high quality enhanced images without artifacts, both in supervised and unsupervised training regimes. Additionally, we show that residual vectors can be interpreted, revealing biases in training data, and thereby enabling potential bias correction.

翻译：本文提出了一种新颖的对比语言-图像预训练（CLIP）引导方法，用于无监督背光图像增强任务。我们的工作基于当前最先进的CLIP-LIT方法，该方法通过在CLIP嵌入空间中约束提示（负样本/正样本）与对应图像（背光图像/正常光照图像）之间的文本-图像相似度，学习一对提示。学习到的提示随后引导图像增强网络。基于CLIP-LIT框架，我们提出了两种用于CLIP引导的新方法。首先，我们证明，无需在文本嵌入空间中调整提示，可以直接在潜在空间中调整其嵌入，而不会损失质量。这加速了训练，并可能启用缺乏文本编码器的额外编码器。其次，我们提出了一种无需任何提示调整的新方法。基于训练数据中背光图像和正常光照图像的CLIP嵌入，我们通过计算正常光照图像与背光图像均值嵌入间的简单差值，得到嵌入空间中的残差向量。该向量在训练过程中引导增强网络，将背光图像推向正常光照图像空间。这种方法进一步显著减少了训练时间，稳定了训练过程，并在有监督和无监督训练模式下均能生成无伪影的高质量增强图像。此外，我们证明残差向量具有可解释性，能揭示训练数据中的偏差，从而可能实现偏差校正。