Post-training is the decisive step for converting a pretrained video generator into a production-oriented model that is instruction-following, controllable, and robust over long temporal horizons. This report presents a systematical post-training framework that organizes supervised policy shaping, reward-driven reinforcement learning, and preference-based refinement into a single stability-constrained optimization stack. The framework is designed around practical video-generation constraints, including high rollout cost, temporally compounding failure modes, and feedback that is heterogeneous, uncertain, and often weakly discriminative. By treating optimization as a staged, diagnostic-driven process rather than a collection of isolated tricks, the report summarizes a cohesive recipe for improving perceptual fidelity, temporal coherence, and prompt adherence while preserving the controllability established at initialization. The resulting framework provides a clear blueprint for building scalable post-training pipelines that remain stable, extensible, and effective in real-world deployment settings.
翻译:后训练是将预训练视频生成器转化为面向生产、遵循指令、可控且在长时序范围内保持鲁棒性的决定性步骤。本报告提出一种系统化的后训练框架,将监督式策略塑形、奖励驱动的强化学习以及基于偏好的精炼整合为单一稳定性约束优化栈。该框架围绕实际视频生成约束设计,包括高昂的推演成本、时间维度上复合的失效模式,以及异质性、不确定且往往弱判别性的反馈。通过将优化视为分阶段、诊断驱动的过程而非孤立技巧的集合,本报告总结了一套提升感知保真度、时序连贯性与提示遵循能力,同时保持初始化阶段已建立可控性的统一方案。最终形成的框架为构建可扩展的后训练流程提供了清晰蓝图,确保其在真实部署场景中保持稳定、可扩展且高效。