Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimilating, its internal hidden states with external pretrained visual features, suggesting its potential for VDM fine-tuning. In this work, we first propose a straightforward adaptation of REPA for VDMs and empirically show that, while effective for convergence, it is suboptimal in preserving semantic consistency across frames. To address this limitation, we introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external features from neighboring frames. Empirical evaluations on large-scale VDMs, including CogVideoX-5B and Hunyuan Video, demonstrate that CREPA improves both visual fidelity and cross-frame semantic coherence when fine-tuned with parameter-efficient methods such as LoRA. We further validate CREPA across diverse datasets with varying attributes, confirming its broad applicability.
翻译:在用户层面微调视频扩散模型以生成反映训练数据特定属性的视频,虽具有重要实践意义,但仍面临显著挑战且研究不足。与此同时,近期工作如表示对齐通过将模型内部隐藏状态与外部预训练视觉特征对齐或同化,显示出提升基于DiT的图像扩散模型收敛性与质量的潜力,这暗示了其在视频扩散模型微调中的应用前景。在本工作中,我们首先提出将表示对齐直接适配于视频扩散模型的方案,并通过实验证明,该方法虽对收敛有效,但在保持帧间语义一致性方面并非最优。为解决此局限,我们引入了跨帧表示对齐,这是一种新颖的正则化技术,可将某一帧的隐藏状态与相邻帧的外部特征进行对齐。在包括CogVideoX-5B和Hunyuan Video在内的大规模视频扩散模型上的实证评估表明,当结合LoRA等参数高效方法进行微调时,跨帧表示对齐能同时提升视觉保真度与跨帧语义连贯性。我们进一步在不同属性的多样化数据集上验证了跨帧表示对齐的有效性,证实了其广泛的适用性。