Recent image tone adjustment (or enhancement) approaches have predominantly adopted supervised learning for learning human-centric perceptual assessment. However, these approaches are constrained by intrinsic challenges of supervised learning. Primarily, the requirement for expertly-curated or retouched images escalates the data acquisition expenses. Moreover, their coverage of target style is confined to stylistic variants inferred from the training data. To surmount the above challenges, we propose an unsupervised learning-based approach for text-based image tone adjustment method, CLIPtone, that extends an existing image enhancement method to accommodate natural language descriptions. Specifically, we design a hyper-network to adaptively modulate the pretrained parameters of the backbone model based on text description. To assess whether the adjusted image aligns with the text description without ground truth image, we utilize CLIP, which is trained on a vast set of language-image pairs and thus encompasses knowledge of human perception. The major advantages of our approach are three fold: (i) minimal data collection expenses, (ii) support for a range of adjustments, and (iii) the ability to handle novel text descriptions unseen in training. Our approach's efficacy is demonstrated through comprehensive experiments, including a user study.
翻译:近期图像色调调整(或增强)方法大多采用监督学习来学习以人类为中心的感知评估。然而,这些方法受限于监督学习的内在挑战。首先,对专家级策展或修饰后图像的需求增加了数据采集成本。此外,其目标风格覆盖范围仅限于从训练数据中推断出的风格变体。为克服上述挑战,我们提出了一种基于无监督学习的文本驱动图像色调调整方法CLIPtone,该方法将现有图像增强技术扩展至自然语言描述。具体而言,我们设计了一个超网络,根据文本描述自适应地调整骨干模型的预训练参数。为评估调整后图像与文本描述的一致性(无需真实参考图像),我们利用在大量语言-图像对数据集上训练并具备人类感知知识的CLIP模型。本方法的核心优势有三:(i)最低的数据采集成本,(ii)支持多样化调整范围,(iii)可处理训练中未见的新文本描述。通过包含用户研究在内的全面实验,验证了本方法的有效性。