Visual anagrams are images that change appearance upon transformation, like flipping or rotation. With the advent of diffusion models, generating such optical illusions can be achieved by averaging noise across multiple views during the reverse denoising process. However, we observe two critical failure modes in this approach: (i) concept segregation, where concepts in different views are independently generated, which can not be considered a true anagram, and (ii) concept domination, where certain concepts overpower others. In this work, we cast the visual anagram generation problem in a multi-task learning setting, where different viewpoint prompts are analogous to different tasks,and derive denoising trajectories that align well across tasks simultaneously. At the core of our designed framework are two newly introduced techniques, where (i) an anti-segregation optimization strategy that promotes overlap in cross-attention maps between different concepts, and (ii) a noise vector balancing method that adaptively adjusts the influence of different tasks. Additionally, we observe that directly averaging noise predictions yields suboptimal performance because statistical properties may not be preserved, prompting us to derive a noise variance rectification method. Extensive qualitative and quantitative experiments demonstrate our method's superior ability to generate visual anagrams spanning diverse concepts.
翻译:视觉字谜是指通过翻转或旋转等变换改变外观的图像。随着扩散模型的出现,通过在反向去噪过程中对多个视角的噪声进行平均,可以生成此类视错觉图像。然而,我们观察到该方法存在两个关键失效模式:(i)概念割裂,即不同视角中的概念被独立生成,这无法被视为真正的字谜;(ii)概念主导,即某些概念压制了其他概念。在本工作中,我们将视觉字谜生成问题置于多任务学习框架中,其中不同视角提示类似于不同任务,并推导出能同时跨任务良好对齐的去噪轨迹。我们设计框架的核心包含两项新技术:(i)反割裂优化策略,旨在促进不同概念间交叉注意力图的重叠;(ii)噪声向量平衡方法,可自适应调整不同任务的影响。此外,我们观察到直接平均噪声预测会导致次优性能,因为统计特性可能无法保持,这促使我们推导出噪声方差校正方法。大量定性与定量实验表明,我们的方法在生成涵盖多样概念的视觉字谜方面具有卓越能力。