Image style transfer occupies an important place in both computer graphics and computer vision. However, most current methods require reference to stylized images and cannot individually stylize specific objects. To overcome this limitation, we propose the "Soulstyler" framework, which allows users to guide the stylization of specific objects in an image through simple textual descriptions. We introduce a large language model to parse the text and identify stylization goals and specific styles. Combined with a CLIP-based semantic visual embedding encoder, the model understands and matches text and image content. We also introduce a novel localized text-image block matching loss that ensures that style transfer is performed only on specified target objects, while non-target regions remain in their original style. Experimental results demonstrate that our model is able to accurately perform style transfer on target objects according to textual descriptions without affecting the style of background regions. Our code will be available at https://github.com/yisuanwang/Soulstyler.
翻译:图像风格迁移在计算机图形学和计算机视觉领域占据重要地位。然而,当前大多数方法需要参考风格化图像,且无法单独对特定对象进行风格化。为克服这一局限,我们提出"Soulstyler"框架,允许用户通过简单的文本描述引导图像中特定对象的风格化。我们引入大语言模型解析文本,识别风格化目标和具体风格;结合基于CLIP的语义视觉嵌入编码器,使模型能够理解并匹配文本与图像内容。同时,我们提出一种新颖的局部文本-图像块匹配损失,确保风格迁移仅应用于指定目标对象,而非目标区域保持原有风格。实验结果表明,我们的模型能够根据文本描述准确地对目标对象执行风格迁移,且不影响背景区域的风格。我们的代码将发布于https://github.com/yisuanwang/Soulstyler。