Text-to-video generation has shown promising results. However, by taking only natural languages as input, users often face difficulties in providing detailed information to precisely control the model's output. In this work, we propose fine-grained controllable video generation (FACTOR) to achieve detailed control. Specifically, FACTOR aims to control objects' appearances and context, including their location and category, in conjunction with the text prompt. To achieve detailed control, we propose a unified framework to jointly inject control signals into the existing text-to-video model. Our model consists of a joint encoder and adaptive cross-attention layers. By optimizing the encoder and the inserted layer, we adapt the model to generate videos that are aligned with both text prompts and fine-grained control. Compared to existing methods relying on dense control signals such as edge maps, we provide a more intuitive and user-friendly interface to allow object-level fine-grained control. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users. Extensive experiments on standard benchmark datasets and user-provided inputs validate that our model obtains a 70% improvement in controllability metrics over competitive baselines.
翻译:文本到视频生成已展现出令人瞩目的成果。然而,仅以自然语言作为输入时,用户往往难以提供足够精细的信息来精确控制模型的输出。本文提出细粒度可控视频生成方法(FACTOR),旨在实现精细控制。具体而言,FACTOR旨在联合文本提示,控制物体的外观及其上下文(包括位置和类别)。为实现精细控制,我们提出一个统一框架,在现有文本到视频模型中联合注入控制信号。该模型由联合编码器与自适应交叉注意力层构成。通过优化编码器与插入层,模型能够生成与文本提示及细粒度控制目标一致的视频。相较于依赖边缘图等密集控制信号的现有方法,我们提供了更直观易用的界面,支持物体级别的细粒度控制。本方法在不需微调的情况下实现物体外观的可控性,显著降低了用户对每个主体的优化成本。在标准基准数据集及用户提供输入上的大量实验表明,相较于竞争基线,我们的模型在可控性指标上实现了70%的提升。