Large-scale language-vision pre-training models, such as CLIP, have achieved remarkable text-guided image morphing results by leveraging several unconditional generative models. However, existing CLIP-guided image morphing methods encounter difficulties when morphing photorealistic images. Specifically, existing guidance fails to provide detailed explanations of the morphing regions within the image, leading to misguidance. In this paper, we observed that such misguidance could be effectively mitigated by simply using a proper regularization loss. Our approach comprises two key components: 1) a geodesic cosine similarity loss that minimizes inter-modality features (i.e., image and text) on a projected subspace of CLIP space, and 2) a latent regularization loss that minimizes intra-modality features (i.e., image and image) on the image manifold. By replacing the na\"ive directional CLIP loss in a drop-in replacement manner, our method achieves superior morphing results on both images and videos for various benchmarks, including CLIP-inversion.
翻译:大规模语言-视觉预训练模型(如CLIP)通过利用若干无条件生成模型,在文本引导图像变形方面取得了显著成果。然而,现有CLIP引导图像变形方法在实现照片级真实感图像变形时面临困难。具体而言,现有引导机制无法对图像中变形区域提供详细解释,从而导致误导性引导。本文发现,通过简单地使用适当的正则化损失即可有效缓解此类误导。我们的方法包含两个关键组成部分:1)测地余弦相似度损失,可在CLIP空间的投影子空间中最小化跨模态特征(即图像与文本)的差异;2)潜在正则化损失,可在图像流形上最小化模态内特征(即图像与图像)的差异。通过以即插即用方式替换朴素的定向CLIP损失,我们的方法在包括CLIP反演在内的多项基准测试中,在图像与视频上均实现了更优的变形结果。