Video diffusion alignment has been heavily relied on scalar rewards. These rewards are typically derived from learned reward models in human preference datasets, requiring additional training and extensive collection. Moreover, scalar rewards provide coarse, global supervision, offering limited prompt-generation mismatch credit assignment and making models prone to reward exploitation and unstable optimization. We propose Diffusion-DRF, a free, rich, and differentiable reward framework for video diffusion fine-tuning. Diffusion-DRF employs a frozen, off-the-shelf Vision-Language Model (VLM) as the critic, eliminating the need for reward model training. Instead of relying on a single scalar reward, it decomposes each user prompt into multi-dimensional questions with freeform dense VQA explanation queries, yielding information-rich feedback. By direct differentiable optimization over this rich feedback, Diffusion-DRF achieves stable reward-based tuning without preference datasets collection. Diffusion-DRF achieves significant gains both quantitatively and qualitatively, outperforming state-of-the-art Flow-GRPO by 4.74% in overall performance on unseen VBench-2.0.
翻译:视频扩散对齐长期以来严重依赖标量奖励。这些奖励通常从人类偏好数据集中的学习奖励模型获得,需要额外训练和大量数据收集。此外,标量奖励提供的是粗糙的全局监督,对提示-生成不匹配的信用分配有限,使得模型容易受到奖励利用和不稳定优化的影响。我们提出了扩散-DRF,一种用于视频扩散微调的自由、丰富且可微分的奖励框架。扩散-DRF采用一个冻结的、现成的视觉语言模型作为评判器,无需奖励模型训练。它不依赖单一标量奖励,而是将每个用户提示分解为具有自由形式密集视觉问答解释查询的多维度问题,从而产生信息丰富的反馈。通过对这种丰富反馈进行直接可微分优化,扩散-DRF实现了无需偏好数据集收集的稳定基于奖励的调优。扩散-DRF在定量和定性评估中均取得显著提升,在未见过的VBench-2.0基准测试中,整体性能优于当前最先进的Flow-GRPO方法4.74%。