Existing techniques for image-to-image translation commonly have suffered from two critical problems: heavy reliance on per-sample domain annotation and/or inability of handling multiple attributes per image. Recent truly-unsupervised methods adopt clustering approaches to easily provide per-sample one-hot domain labels. However, they cannot account for the real-world setting: one sample may have multiple attributes. In addition, the semantics of the clusters are not easily coupled to the human understanding. To overcome these, we present a LANguage-driven Image-to-image Translation model, dubbed LANIT. We leverage easy-to-obtain candidate attributes given in texts for a dataset: the similarity between images and attributes indicates per-sample domain labels. This formulation naturally enables multi-hot label so that users can specify the target domain with a set of attributes in language. To account for the case that the initial prompts are inaccurate, we also present prompt learning. We further present domain regularization loss that enforces translated images be mapped to the corresponding domain. Experiments on several standard benchmarks demonstrate that LANIT achieves comparable or superior performance to existing models.
翻译:现有图像到图像翻译技术通常面临两个关键问题:严重依赖每个样本的域标注,以及无法处理每幅图像的多重属性。近期完全无监督方法采用聚类技术来简便地为每个样本提供独热域标签,但这类方法无法适应现实场景:一个样本可能具备多重属性。此外,聚类语义难以与人类理解建立直观关联。为解决上述问题,我们提出语言驱动图像到图像翻译模型LANIT。我们利用数据集中文本形式给出的易获取候选属性:图像与属性之间的相似性指示了每个样本的域标签。该框架自然支持多热标签,使用户能够通过语言中的属性集合指定目标域。针对初始提示不准确的情况,我们进一步提出提示学习机制,并引入域正则化损失函数强制翻译后图像映射至对应域。在多个标准基准上的实验表明,LANIT取得了与现有模型相当或更优的性能。