This paper introduces a method for realistic kinetic typography that generates user-preferred animatable 'text content'. We draw on recent advances in guided video diffusion models to achieve visually-pleasing text appearances. To do this, we first construct a kinetic typography dataset, comprising about 600K videos. Our dataset is made from a variety of combinations in 584 templates designed by professional motion graphics designers and involves changing each letter's position, glyph, and size (i.e., flying, glitches, chromatic aberration, reflecting effects, etc.). Next, we propose a video diffusion model for kinetic typography. For this, there are three requirements: aesthetic appearances, motion effects, and readable letters. This paper identifies the requirements. For this, we present static and dynamic captions used as spatial and temporal guidance of a video diffusion model, respectively. The static caption describes the overall appearance of the video, such as colors, texture and glyph which represent a shape of each letter. The dynamic caption accounts for the movements of letters and backgrounds. We add one more guidance with zero convolution to determine which text content should be visible in the video. We apply the zero convolution to the text content, and impose it on the diffusion model. Lastly, our glyph loss, only minimizing a difference between the predicted word and its ground-truth, is proposed to make the prediction letters readable. Experiments show that our model generates kinetic typography videos with legible and artistic letter motions based on text prompts.
翻译:本文提出了一种生成用户偏好可动画化"文本内容"的逼真动态字体方法。我们借鉴引导视频扩散模型的最新进展,以实现视觉愉悦的文本外观。为此,我们首先构建了一个包含约60万个视频的动态字体数据集。该数据集基于专业动态图形设计师设计的584个模板的多样化组合制作而成,涉及改变每个字母的位置、字形和大小(如飞行效果、故障效果、色差效果、反射效果等)。接着,我们提出了一个用于动态字体的视频扩散模型。该模型需满足三个要求:美学外观、运动效果和字母可读性。本文明确了这些要求,并分别提出了作为视频扩散模型空间引导和时间引导的静态描述与动态描述。静态描述刻画视频的整体外观,如颜色、纹理和表征字母形状的字形;动态描述则解释字母与背景的运动轨迹。我们额外引入零卷积引导机制来确定视频中应显示何种文本内容:通过对文本内容施加零卷积操作,并将其嵌入扩散模型。最后,我们提出字形损失函数——仅最小化预测词汇与其真实值之间的差异——以确保预测字母的可读性。实验表明,我们的模型能够根据文本提示生成具有清晰可辨且富有艺术感的字母运动效果动态字体视频。