We introduce Text2Cinemagraph, a fully automated method for creating cinemagraphs from text descriptions - an especially challenging task when prompts feature imaginary elements and artistic styles, given the complexity of interpreting the semantics and motions of these images. Existing single-image animation methods fall short on artistic inputs, and recent text-based video methods frequently introduce temporal inconsistencies, struggling to keep certain regions static. To address these challenges, we propose an idea of synthesizing image twins from a single text prompt - a pair of an artistic image and its pixel-aligned corresponding natural-looking twin. While the artistic image depicts the style and appearance detailed in our text prompt, the realistic counterpart greatly simplifies layout and motion analysis. Leveraging existing natural image and video datasets, we can accurately segment the realistic image and predict plausible motion given the semantic information. The predicted motion can then be transferred to the artistic image to create the final cinemagraph. Our method outperforms existing approaches in creating cinemagraphs for natural landscapes as well as artistic and other-worldly scenes, as validated by automated metrics and user studies. Finally, we demonstrate two extensions: animating existing paintings and controlling motion directions using text.
翻译:我们提出Text2Cinemagraph,一种从文本描述自动生成电影级动态图像的全新方法——当提示词包含想象元素和艺术风格时,这项任务尤为具有挑战性,因为需要解析这些图像的语义和运动信息。现有单图像动画方法在艺术输入上表现不足,而近期基于文本的视频方法常出现时间不一致问题,难以保持特定区域的静态效果。为解决这些挑战,我们提出从单一文本提示生成图像孪生对的概念——即一对艺术图像及其像素对齐的自然风格孪生图像。艺术图像呈现文本提示中描述的样式与外观,而逼真的对应图则大幅简化布局与运动分析。利用现有自然图像与视频数据集,我们可准确分割逼真图像,并基于语义信息预测合理运动。随后,预测的运动可迁移至艺术图像,生成最终的电影级动态图像。在自然景观、艺术风格及奇幻场景的电影级动态图像生成任务中,我们的方法在自动化指标与用户研究中均优于现有方法。最后,我们展示两项扩展应用:将已有画作动画化,以及通过文本控制运动方向。