Text-based style transfer is a newly-emerging research topic that uses text information instead of style image to guide the transfer process, significantly extending the application scenario of style transfer. However, previous methods require extra time for optimization or text-image paired data, leading to limited effectiveness. In this work, we achieve a data-efficient text-based style transfer method that does not require optimization at the inference stage. Specifically, we convert text input to the style space of the pre-trained VGG network to realize a more effective style swap. We also leverage CLIP's multi-modal embedding space to learn the text-to-style mapping with the image dataset only. Our method can transfer arbitrary new styles of text input in real-time and synthesize high-quality artistic images.
翻译:文本驱动风格迁移是新兴的研究方向,其利用文本信息替代风格图像引导迁移过程,显著拓展了风格迁移的应用场景。然而,现有方法需要额外时间进行优化或依赖文本-图像配对数据,导致效率受限。本文提出了一种数据高效的文本驱动风格迁移方法,在推理阶段无需优化。具体而言,我们将文本输入映射至预训练VGG网络的风格空间以实现更有效的风格交换,同时利用CLIP多模态嵌入空间仅通过图像数据集学习文本到风格的映射关系。本方法可实时迁移任意新文本输入风格,并合成高质量艺术图像。