PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

Text-to-video (T2V) generation aims to synthesize videos with high visual quality and temporal consistency that are semantically aligned with input text. Reward-based post-training has emerged as a promising direction to improve the quality and semantic alignment of generated videos. However, recent methods either rely on large-scale human preference annotations or operate on misaligned embeddings from pre-trained vision-language models, leading to limited scalability or suboptimal supervision. We present $\texttt{PISCES}$, an annotation-free post-training algorithm that addresses these limitations via a novel Dual Optimal Transport (OT)-aligned Rewards module. To align reward signals with human judgment, $\texttt{PISCES}$ uses OT to bridge text and video embeddings at both distributional and discrete token levels, enabling reward supervision to fulfill two objectives: (i) a Distributional OT-aligned Quality Reward that captures overall visual quality and temporal coherence; and (ii) a Discrete Token-level OT-aligned Semantic Reward that enforces semantic, spatio-temporal correspondence between text and video tokens. To our knowledge, $\texttt{PISCES}$ is the first to improve annotation-free reward supervision in generative post-training through the lens of OT. Experiments on both short- and long-video generation show that $\texttt{PISCES}$ outperforms both annotation-based and annotation-free methods on VBench across Quality and Semantic scores, with human preference studies further validating its effectiveness. We show that the Dual OT-aligned Rewards module is compatible with multiple optimization paradigms, including direct backpropagation and reinforcement learning fine-tuning.

翻译：文本到视频（T2V）生成旨在合成具有高视觉质量、时间一致性且与输入文本语义对齐的视频。基于奖励的后训练已成为提升生成视频质量与语义对齐性的一个有前景的方向。然而，现有方法要么依赖大规模人工偏好标注，要么依赖于预训练视觉-语言模型中未对齐的嵌入，导致可扩展性有限或监督信号欠佳。我们提出 $\texttt{PISCES}$，一种无标注的后训练算法，通过新颖的双重最优传输（OT）对齐奖励模块来解决这些局限性。为使奖励信号与人类判断对齐，$\texttt{PISCES}$ 利用 OT 在分布层面和离散标记层面桥接文本与视频嵌入，使奖励监督能够实现两个目标：（i）分布级 OT 对齐质量奖励，捕捉整体视觉质量与时间连贯性；（ii）离散标记级 OT 对齐语义奖励，强制文本与视频标记之间的语义、时空对应关系。据我们所知，$\texttt{PISCES}$ 是首个通过 OT 视角改进生成式后训练中无标注奖励监督的方法。在短视频与长视频生成上的实验表明，$\texttt{PISCES}$ 在 VBench 的质量与语义评分上均优于基于标注和无标注的方法，人工偏好研究进一步验证了其有效性。我们证明双重 OT 对齐奖励模块兼容多种优化范式，包括直接反向传播与强化学习微调。