By generating plausible and smooth transitions between two image frames, video inbetweening is an essential tool for video editing and long video synthesis. Traditional works lack the capability to generate complex large motions. While recent video generation techniques are powerful in creating high-quality results, they often lack fine control over the details of intermediate frames, which can lead to results that do not align with the creative mind. We introduce MotionBridge, a unified video inbetweening framework that allows flexible controls, including trajectory strokes, keyframes, masks, guide pixels, and text. However, learning such multi-modal controls in a unified framework is a challenging task. We thus design two generators to extract the control signal faithfully and encode feature through dual-branch embedders to resolve ambiguities. We further introduce a curriculum training strategy to smoothly learn various controls. Extensive qualitative and quantitative experiments have demonstrated that such multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.
翻译:通过在两个图像帧之间生成合理且平滑的过渡,视频插帧技术已成为视频编辑与长视频合成的重要工具。传统方法难以生成复杂的大幅度运动。尽管近期视频生成技术在创造高质量结果方面表现强大,但它们通常缺乏对中间帧细节的精细控制,可能导致生成结果与创作意图不符。本文提出MotionBridge,一个支持灵活控制的统一视频插帧框架,其控制方式包括轨迹笔触、关键帧、掩码、引导像素及文本。然而,在统一框架中学习此类多模态控制是一项具有挑战性的任务。为此,我们设计了两个生成器以准确提取控制信号,并通过双分支嵌入器编码特征以消除歧义。我们进一步引入课程学习策略来平稳掌握多种控制方式。大量定性与定量实验表明,此类多模态控制能够实现更具动态性、可定制性且符合上下文语境的视觉叙事。