JointTuner：面向定制化视频生成的外观-运动自适应联合训练 (JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation)

Recent text-to-video advancements have enabled coherent video synthesis from prompts and expanded to fine-grained control over appearance and motion. However, existing methods either suffer from concept interference due to feature domain mismatch caused by naive decoupled optimizations or exhibit appearance contamination induced by spatial feature leakage resulting from the entanglement of motion and appearance in reference video reconstructions. In this paper, we propose JointTuner, a novel adaptive joint training framework, to alleviate these issues. Specifically, we develop Adaptive LoRA, which incorporates a context-aware gating mechanism, and integrate the gated LoRA components into the spatial and temporal Transformers within the diffusion model. These components enable simultaneous optimization of appearance and motion, eliminating concept interference. In addition, we introduce the Appearance-independent Temporal Loss, which decouples motion patterns from intrinsic appearance in reference video reconstructions through an appearance-agnostic noise prediction task. The key innovation lies in adding frame-wise offset noise to the ground-truth Gaussian noise, perturbing its distribution, thereby disrupting spatial attributes associated with frames while preserving temporal coherence. Furthermore, we construct a benchmark comprising 90 appearance-motion customized combinations and 10 multi-type automatic metrics across four dimensions, facilitating a more comprehensive evaluation for this customization task. Extensive experiments demonstrate the superior performance of our method compared to current advanced approaches.

翻译：近期文本到视频技术的进展已能根据提示生成连贯的视频，并扩展到对外观和运动进行细粒度控制。然而，现有方法要么因解耦优化导致的特征域不匹配而遭受概念干扰，要么因参考视频重建中运动与外观的纠缠导致空间特征泄漏，从而引发外观污染。本文提出JointTuner，一种新颖的自适应联合训练框架，以缓解这些问题。具体而言，我们开发了包含上下文感知门控机制的自适应LoRA，并将门控LoRA组件集成到扩散模型内的空间与时间Transformer中。这些组件能够同时优化外观与运动，消除概念干扰。此外，我们引入了外观无关的时间损失，它通过一个外观无关的噪声预测任务，在参考视频重建中将运动模式从固有外观中解耦。其核心创新在于向真实高斯噪声添加逐帧偏移噪声，扰动其分布，从而破坏与帧相关的空间属性，同时保持时间连贯性。进一步，我们构建了一个包含90种外观-运动定制组合和10个跨四个维度的多类型自动指标的基准，为该定制化任务提供了更全面的评估。大量实验证明，与当前先进方法相比，我们的方法具有优越的性能。