Visual anagram is an intriguing form of art creation wherein a single image presents different conceptual interpretations under transformations such as flipping or rotation. Recent work has achieved visual anagram synthesis by leveraging pretrained text-to-image (T2I) diffusion models, yet still suffers from several key limitations including computational inefficiency, suboptimal aesthetic quality, and weak semantic fidelity and expressiveness. This work focuses on generating visual anagrams with substantially improved visual quality at minimal computational cost, thereby advancing intelligent creation of illusionary digital art. To increase image resolution while reducing time overhead, we adapt the cutting-edge parallel denoising algorithm from pixel-based T2I model to the adversarially distilled latent-based one, and accordingly propose a structure-semantic co-optimization (S2CO) framework to counteract the consequent visual degradation. As the core of our approach, S2CO framework comprises three key innovations: (\romannumeral1) null-text structure alignment optimization; (\romannumeral2) semantic enhancement optimization; (\romannumeral3) attention-guided noise fusion. Building upon these components, our method dubbed \textbf{S2CO-Anagram} is able to generate higher-resolution anagram images with noticeably superior visual harmony and semantic faithfulness than related SOTA approaches, all while achieving substantially faster inference speed. Code will be publicly available.
翻译:视觉回文是一种有趣的艺术创作形式,其中单幅图像在翻转或旋转等变换下呈现不同的概念解读。最近的研究利用预训练的文本到图像扩散模型实现了视觉回文合成,但仍面临若干关键限制,包括计算效率低下、审美质量欠佳以及语义保真度和表现力不足。本研究聚焦于在最小计算成本下生成视觉质量显著提升的视觉回文,从而推进幻觉数字艺术的智能创作。为在提高图像分辨率的同时降低时间开销,我们将基于像素的文本到图像模型中的前沿并行去噪算法适配至对抗蒸馏的潜在模型,并相应地提出结构-语义协同优化框架以抵消随之而来的视觉退化。作为方法的核心,S2CO框架包含三项关键创新:(i)零文本结构对齐优化;(ii)语义增强优化;(iii)注意力引导噪声融合。基于这些组件,我们提出的方法S2CO-Anagram能够生成分辨率更高、视觉协调性和语义忠实度显著优于相关最先进方法的回文图像,同时实现大幅更快的推理速度。代码将公开提供。