Video generative models have recently achieved notable advancements in synthesis quality. However, generating complex motions remains a critical challenge, as existing models often struggle to produce natural, smooth, and contextually consistent movements. This gap between generated and real-world motions limits their practical applicability. To address this issue, we introduce RealDPO, a novel alignment paradigm that leverages real-world data as positive samples for preference learning, enabling more accurate motion synthesis. Unlike traditional supervised fine-tuning (SFT), which offers limited corrective feedback, RealDPO employs Direct Preference Optimization (DPO) with a tailored loss function to enhance motion realism. By contrasting real-world videos with erroneous model outputs, RealDPO enables iterative self-correction, progressively refining motion quality. To support post-training in complex motion synthesis, we propose RealAction-5K, a curated dataset of high-quality videos capturing human daily activities with rich and precise motion details. Extensive experiments demonstrate that RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.
翻译:视频生成模型近期在合成质量方面取得了显著进展。然而,生成复杂运动仍然是一个关键挑战,因为现有模型往往难以产生自然、流畅且上下文一致的运动。生成运动与真实世界运动之间的差距限制了其实际应用。为解决这一问题,我们提出了RealDPO,一种新颖的对齐范式,它利用真实世界数据作为偏好学习的正样本,从而实现更精确的运动合成。与提供有限纠正反馈的传统监督微调(SFT)不同,RealDPO采用直接偏好优化(DPO)并结合定制化的损失函数来增强运动真实感。通过对比真实世界视频与错误的模型输出,RealDPO实现了迭代式自我校正,逐步提升运动质量。为支持复杂运动合成的后训练,我们提出了RealAction-5K,这是一个精心策划的高质量视频数据集,捕捉了人类日常活动,并包含丰富且精确的运动细节。大量实验表明,与最先进的模型及现有偏好优化技术相比,RealDPO在视频质量、文本对齐和运动真实感方面均有显著提升。