Generating rich and controllable motion is a pivotal challenge in video synthesis. We propose Boximator, a new approach for fine-grained motion control. Boximator introduces two constraint types: hard box and soft box. Users select objects in the conditional frame using hard boxes and then use either type of boxes to roughly or rigorously define the object's position, shape, or motion path in future frames. Boximator functions as a plug-in for existing video diffusion models. Its training process preserves the base model's knowledge by freezing the original weights and training only the control module. To address training challenges, we introduce a novel self-tracking technique that greatly simplifies the learning of box-object correlations. Empirically, Boximator achieves state-of-the-art video quality (FVD) scores, improving on two base models, and further enhanced after incorporating box constraints. Its robust motion controllability is validated by drastic increases in the bounding box alignment metric. Human evaluation also shows that users favor Boximator generation results over the base model.
翻译:在视频合成中,生成丰富且可控的运动是一项关键挑战。我们提出Boximator,一种用于精细运动控制的新方法。Boximator引入了两种约束类型:硬框和软框。用户使用硬框在条件帧中选择物体,随后利用任一种框大致或精确地定义该物体在未来帧中的位置、形状或运动路径。Boximator可作为现有视频扩散模型的插件,其训练过程通过冻结原始权重并仅训练控制模块来保留基础模型的知识。为解决训练难题,我们引入了一种新颖的自追踪技术,极大简化了框-物体关联的学习过程。实验表明,Boximator在两种基础模型上实现了最先进的视频质量(FVD)评分,并在引入框约束后进一步优化。其强大的运动可控性通过边界框对齐指标的显著提升得到验证。人工评估也表明,用户更偏好Boximator的生成结果而非基础模型。