Automatic image editing has great demands because of its numerous applications, and the use of natural language instructions is essential to achieving flexible and intuitive editing as the user imagines. A pioneering work in text-driven image editing, StyleCLIP, finds an edit direction in the CLIP space and then edits the image by mapping the direction to the StyleGAN space. At the same time, it is difficult to tune appropriate inputs other than the original image and text instructions for image editing. In this study, we propose a method to construct the edit direction adaptively in the StyleGAN and CLIP spaces with SVM. Our model represents the edit direction as a normal vector in the CLIP space obtained by training a SVM to classify positive and negative images. The images are retrieved from a large-scale image corpus, originally used for pre-training StyleGAN, according to the CLIP similarity between the images and the text instruction. We confirmed that our model performed as well as the StyleCLIP baseline, whereas it allows simple inputs without increasing the computational time.
翻译:自动图像编辑因其广泛的应用而具有巨大需求,而使用自然语言指令对于实现用户所设想的灵活且直观的编辑至关重要。文本驱动图像编辑领域的开创性工作StyleCLIP在CLIP空间中寻找编辑方向,随后通过将该方向映射至StyleGAN空间实现图像编辑。然而,除原始图像和文本指令外,为图像编辑调优合适的输入参数存在困难。本研究提出一种方法,利用支持向量机(SVM)在StyleGAN与CLIP空间中自适应构建编辑方向。我们的模型通过训练SVM对正负样本图像进行分类,将编辑方向表示为CLIP空间中的法向量。这些图像根据与文本指令的CLIP相似性,从最初用于预训练StyleGAN的大规模图像语料库中检索得到。实验证实,尽管我们的模型允许更简化的输入且不增加计算时间,其表现仍与StyleCLIP基准相当。