MagicDistillation: Weak-to-Strong Video Distillation for Large-Scale Few-Step Synthesis

Recently, open-source video diffusion models (VDMs), such as WanX, Magic141 and HunyuanVideo, have been scaled to over 10 billion parameters. These large-scale VDMs have demonstrated significant improvements over smaller-scale VDMs across multiple dimensions, including enhanced visual quality and more natural motion dynamics. However, these models face two major limitations: (1) High inference overhead: Large-scale VDMs require approximately 10 minutes to synthesize a 28-step video on a single H100 GPU. (2) Limited in portrait video synthesis: Models like WanX-I2V and HunyuanVideo-I2V often produce unnatural facial expressions and movements in portrait videos. To address these challenges, we propose MagicDistillation, a novel framework designed to reduce inference overhead while ensuring the generalization of VDMs for portrait video synthesis. Specifically, we primarily use sufficiently high-quality talking video to fine-tune Magic141, which is dedicated to portrait video synthesis. We then employ LoRA to effectively and efficiently fine-tune the fake DiT within the step distillation framework known as distribution matching distillation (DMD). Following this, we apply weak-to-strong (W2S) distribution matching and minimize the discrepancy between the fake data distribution and the ground truth distribution, thereby improving the visual fidelity and motion dynamics of the synthesized videos. Experimental results on portrait video synthesis demonstrate the effectiveness of MagicDistillation, as our method surpasses Euler, LCM, and DMD baselines in both FID/FVD metrics and VBench. Moreover, MagicDistillation, requiring only 4 steps, also outperforms WanX-I2V (14B) and HunyuanVideo-I2V (13B) on visualization and VBench. Our project page is https://magicdistillation.github.io/MagicDistillation/.

翻译：近期，开源视频扩散模型（VDMs），如WanX、Magic141与HunayuanVideo，其参数量已扩展至超过100亿规模。相较于小规模VDMs，这些大规模模型在多个维度展现出显著优势，包括提升的视觉质量与更自然的运动动态。然而，此类模型面临两大主要局限：（1）高昂的推理开销：大规模VDMs在单张H100 GPU上合成一段28步视频约需10分钟；（2）人像视频合成能力受限：如WanX-I2V与HunayuanVideo-I2V等模型在人像视频中常生成不自然的面部表情与动作。为应对这些挑战，本文提出MagicDistillation——一种旨在降低推理开销同时确保VDMs在人像视频合成中泛化能力的新型框架。具体而言，我们首先采用质量足够高的谈话视频对专用于人像视频合成的Magic141进行微调。随后，在基于分布匹配蒸馏（DMD）的步数蒸馏框架内，利用LoRA对伪DiT模块进行高效且有效的微调。此后，我们应用弱监督至强监督（W2S）分布匹配策略，最小化伪数据分布与真实分布间的差异，从而提升合成视频的视觉保真度与运动动态。在人像视频合成任务上的实验结果表明，MagicDistillation在FID/FVD指标与VBench评估中均优于Euler、LCM及DMD基线方法。此外，仅需4步推理的MagicDistillation在可视化效果与VBench评分上亦超越WanX-I2V（140亿参数）和HunayuanVideo-I2V（130亿参数）。项目页面详见 https://magicdistillation.github.io/MagicDistillation/。