Video generation models have recently achieved impressive visual fidelity and temporal coherence. Yet, they continue to struggle with complex, non-rigid motions, especially when synthesizing humans performing dynamic actions such as sports, dance, etc. Generated videos often exhibit missing or extra limbs, distorted poses, or physically implausible actions. In this work, we propose a remarkably simple reward model, HuDA, to quantify and improve the human motion in generated videos. HuDA integrates human detection confidence for appearance quality, and a temporal prompt alignment score to capture motion realism. We show this simple reward function that leverages off-the-shelf models without any additional training, outperforms specialized models finetuned with manually annotated data. Using HuDA for Group Reward Policy Optimization (GRPO) post-training of video models, we significantly enhance video generation, especially when generating complex human motions, outperforming state-of-the-art models like Wan 2.1, with win-rate of 73%. Finally, we demonstrate that HuDA improves generation quality beyond just humans, for instance, significantly improving generation of animal videos and human-object interactions.
翻译:视频生成模型近期在视觉保真度与时间连贯性方面取得了显著进展。然而,在处理复杂的非刚性运动时,尤其是在合成人类执行动态动作(如体育运动、舞蹈等)的场景中,这些模型仍面临挑战。生成的视频常出现肢体缺失或多余、姿态扭曲或动作物理上不合理等问题。本研究提出了一种极为简单的奖励模型HuDA,用于量化并改善生成视频中的人体运动。HuDA整合了人体检测置信度以评估外观质量,以及时序提示对齐分数以捕捉运动真实性。我们证明,这种无需额外训练、仅利用现成模型的简单奖励函数,其性能优于使用人工标注数据微调的专业模型。通过将HuDA应用于视频模型的组奖励策略优化(GRPO)后训练,我们显著提升了视频生成质量,特别是在生成复杂人体运动方面,其表现超越了Wan 2.1等最先进模型,胜率达73%。最后,我们展示了HuDA在提升生成质量方面的泛化能力,例如在动物视频生成及人物-物体交互生成等任务中也取得了显著改进。