Transferring visual style between images while preserving semantic correspondence between similar objects remains a central challenge in computer vision. While existing methods have made great strides, most of them operate at global level but overlook region-wise and even pixel-wise semantic correspondence. To address this, we propose CoCoDiff, a novel training-free and low-cost style transfer framework that leverages pretrained latent diffusion models to achieve fine-grained, semantically consistent stylization. We identify that correspondence cues within generative diffusion models are under-explored and that content consistency across semantically matched regions is often neglected. CoCoDiff introduces a pixel-wise semantic correspondence module that mines intermediate diffusion features to construct a dense alignment map between content and style images. Furthermore, a cycle-consistency module then enforces structural and perceptual alignment across iterations, yielding object and region level stylization that preserves geometry and detail. Despite requiring no additional training or supervision, CoCoDiff delivers state-of-the-art visual quality and strong quantitative results, outperforming methods that rely on extra training or annotations.
翻译:在保持相似对象间语义对应的同时实现图像间视觉风格迁移,仍然是计算机视觉领域的核心挑战。现有方法虽已取得显著进展,但大多在全局层面操作,忽视了区域乃至像素级的语义对应关系。为此,我们提出CoCoDiff——一种无需训练、低成本的风格迁移框架,该框架利用预训练的潜在扩散模型实现细粒度、语义一致的风格化。我们发现生成式扩散模型内部的对应线索尚未被充分探索,且跨语义匹配区域的内容一致性常被忽视。CoCoDiff引入像素级语义对应模块,通过挖掘扩散过程的中间特征来构建内容图像与风格图像间的稠密对齐映射。此外,循环一致性模块在迭代过程中强制保持结构与感知对齐,从而生成能保留几何结构与细节的对象级与区域级风格化结果。尽管无需额外训练或监督,CoCoDiff仍能提供最先进的视觉质量与强劲的量化性能,其表现优于依赖额外训练或标注的方法。