We introduce Text2Cinemagraph, a fully automated method for creating cinemagraphs from text descriptions - an especially challenging task when prompts feature imaginary elements and artistic styles, given the complexity of interpreting the semantics and motions of these images. We focus on cinemagraphs of fluid elements, such as flowing rivers, and drifting clouds, which exhibit continuous motion and repetitive textures. Existing single-image animation methods fall short on artistic inputs, and recent text-based video methods frequently introduce temporal inconsistencies, struggling to keep certain regions static. To address these challenges, we propose an idea of synthesizing image twins from a single text prompt - a pair of an artistic image and its pixel-aligned corresponding natural-looking twin. While the artistic image depicts the style and appearance detailed in our text prompt, the realistic counterpart greatly simplifies layout and motion analysis. Leveraging existing natural image and video datasets, we can accurately segment the realistic image and predict plausible motion given the semantic information. The predicted motion can then be transferred to the artistic image to create the final cinemagraph. Our method outperforms existing approaches in creating cinemagraphs for natural landscapes as well as artistic and other-worldly scenes, as validated by automated metrics and user studies. Finally, we demonstrate two extensions: animating existing paintings and controlling motion directions using text.
翻译:我们提出了Text2Cinemagraph方法,一种从文本描述自动生成动态静图的完整流程——当提示涉及虚构元素和艺术风格时,由于需要解析这些图像的语义与运动信息,这项任务尤为困难。我们聚焦于流体元素的动态静图,例如流动的河流与飘移的云层,这类元素具有连续运动和重复纹理特征。现有单图像动画方法难以处理艺术化输入,而近期基于文本的视频方法常产生时间不一致性,且难以保持某些区域静止。为解决这些挑战,我们提出从单一文本提示合成图像孪生的思路——即生成一对由艺术图像及其像素对齐的自然风格对应图像组成的孪生体。艺术图像呈现文本提示中描述的样式与外观,而自然风格的对应图像则极大简化了布局与运动分析。借助现有自然图像与视频数据集,我们能够对自然风格图像进行精确分割,并基于语义信息预测合理运动。最终将预测的运动迁移至艺术图像,生成动态静图。经自动化指标与用户研究验证,本方法在自然景观、艺术化及异世界场景的动态静图生成中均优于现有方法。最后,我们展示了两个扩展应用:对现有画作进行动画化,以及通过文本控制运动方向。