Existing techniques for image-to-image translation commonly have suffered from two critical problems: heavy reliance on per-sample domain annotation and/or inability of handling multiple attributes per image. Recent truly-unsupervised methods adopt clustering approaches to easily provide per-sample one-hot domain labels. However, they cannot account for the real-world setting: one sample may have multiple attributes. In addition, the semantics of the clusters are not easily coupled to the human understanding. To overcome these, we present a LANguage-driven Image-to-image Translation model, dubbed LANIT. We leverage easy-to-obtain candidate attributes given in texts for a dataset: the similarity between images and attributes indicates per-sample domain labels. This formulation naturally enables multi-hot label so that users can specify the target domain with a set of attributes in language. To account for the case that the initial prompts are inaccurate, we also present prompt learning. We further present domain regularization loss that enforces translated images be mapped to the corresponding domain. Experiments on several standard benchmarks demonstrate that LANIT achieves comparable or superior performance to existing models.
翻译:现有的图像到图像翻译技术通常面临两个关键问题:严重依赖每个样本的域标注,以及无法处理单张图像中的多重属性。近期完全无监督的方法采用聚类技术来轻松获取每个样本的独热域标签,然而这类方法无法应对现实场景——单个样本可能包含多个属性。此外,聚类结果的语义难以与人类理解相耦合。为解决这些问题,我们提出了语言驱动的图像到图像翻译模型LANIT。通过利用数据集中文本形式提供的易获取候选属性,图像与属性之间的相似度可指示每个样本的域标签。这种设计天然支持多热标签,使用户能够通过语言中的属性集合指定目标域。针对初始提示不准确的情况,我们还引入了提示学习机制。此外,我们提出域正则化损失函数,强制翻译后的图像映射到对应域。在多个标准基准上的实验表明,LANIT达到了与现有模型相当或更优的性能。