Video generation is a challenging yet pivotal task in various industries, such as gaming, e-commerce, and advertising. One significant unresolved aspect within T2V is the effective visualization of text within generated videos. Despite the progress achieved in Text-to-Video~(T2V) generation, current methods still cannot effectively visualize texts in videos directly, as they mainly focus on summarizing semantic scene information, understanding, and depicting actions. While recent advances in image-level visual text generation show promise, transitioning these techniques into the video domain faces problems, notably in preserving textual fidelity and motion coherence. In this paper, we propose an innovative approach termed Text-Animator for visual text video generation. Text-Animator contains a text embedding injection module to precisely depict the structures of visual text in generated videos. Besides, we develop a camera control module and a text refinement module to improve the stability of generated visual text by controlling the camera movement as well as the motion of visualized text. Quantitative and qualitative experimental results demonstrate the superiority of our approach to the accuracy of generated visual text over state-of-the-art video generation methods. The project page can be found at https://laulampaul.github.io/text-animator.html.
翻译:视频生成是游戏、电子商务和广告等多个行业中一项具有挑战性但至关重要的任务。在文本到视频(T2V)生成领域,一个尚未解决的重要方面是如何在生成的视频中有效地可视化文本。尽管文本到视频(T2V)生成已取得进展,但现有方法仍无法直接在视频中有效可视化文本,因为它们主要侧重于总结语义场景信息、理解并描绘动作。虽然图像级视觉文本生成的最新进展显示出潜力,但将这些技术迁移到视频领域面临问题,尤其是在保持文本保真度和运动连贯性方面。本文提出了一种创新的视觉文本视频生成方法,称为Text-Animator。Text-Animator包含一个文本嵌入注入模块,用于精确描绘生成视频中视觉文本的结构。此外,我们开发了相机控制模块和文本优化模块,通过控制相机运动以及可视化文本的运动来提高生成视觉文本的稳定性。定量和定性实验结果均证明,在生成视觉文本的准确性方面,我们的方法优于最先进的视频生成方法。项目页面可在 https://laulampaul.github.io/text-animator.html 找到。