Image style transfer occupies an important place in both computer graphics and computer vision. However, most current methods require reference to stylized images and cannot individually stylize specific objects. To overcome this limitation, we propose the "Soulstyler" framework, which allows users to guide the stylization of specific objects in an image through simple textual descriptions. We introduce a large language model to parse the text and identify stylization goals and specific styles. Combined with a CLIP-based semantic visual embedding encoder, the model understands and matches text and image content. We also introduce a novel localized text-image block matching loss that ensures that style transfer is performed only on specified target objects, while non-target regions remain in their original style. Experimental results demonstrate that our model is able to accurately perform style transfer on target objects according to textual descriptions without affecting the style of background regions. Our code will be available at https://github.com/yisuanwang/Soulstyler.
翻译:图像风格迁移在计算机图形学和计算机视觉中占据重要地位。然而,当前大多数方法需要参考风格化图像,且无法针对特定对象进行单独风格化。为克服这一局限,我们提出"Soulstyler"框架,允许用户通过简单的文本描述引导图像中特定对象的风格化。我们引入大语言模型解析文本,识别风格化目标与具体风格。结合基于CLIP的语义视觉嵌入编码器,该模型能够理解并匹配文本与图像内容。同时,我们提出一种新颖的局部化文本-图像块匹配损失,确保风格迁移仅作用于指定目标对象,而非目标区域保持原有风格。实验结果表明,我们的模型能够根据文本描述精准地对目标对象进行风格迁移,且不影响背景区域的风格。代码将开源至https://github.com/yisuanwang/Soulstyler。