Generative models have made significant progress in synthesizing visual content, including images, videos, and 3D/4D structures. However, they are typically trained with surrogate objectives such as likelihood or reconstruction loss, which often misalign with perceptual quality, semantic accuracy, or physical realism. Reinforcement learning (RL) offers a principled framework for optimizing non-differentiable, preference-driven, and temporally structured objectives. Recent advances demonstrate its effectiveness in enhancing controllability, consistency, and human alignment across generative tasks. This survey provides a systematic overview of RL-based methods for visual content generation. We review the evolution of RL from classical control to its role as a general-purpose optimization tool, and examine its integration into image, video, and 3D/4D generation. Across these domains, RL serves not only as a fine-tuning mechanism but also as a structural component for aligning generation with complex, high-level goals. We conclude with open challenges and future research directions at the intersection of RL and generative modeling.
翻译:生成模型在合成视觉内容(包括图像、视频及3D/4D结构)方面已取得显著进展。然而,这些模型通常采用似然度或重构损失等替代目标进行训练,这些目标常与感知质量、语义准确性或物理真实性存在偏差。强化学习(RL)为优化不可微、偏好驱动及时间结构化的目标提供了一个原则性框架。最新研究表明,强化学习在提升生成任务的可控性、一致性与人类对齐方面具有显著效果。本文系统综述了基于强化学习的视觉内容生成方法。我们回顾了强化学习从经典控制到通用优化工具的发展历程,并探讨了其在图像、视频及3D/4D生成中的融合应用。在这些领域中,强化学习不仅作为微调机制,更作为使生成过程与复杂高层次目标对齐的结构化组件。最后,我们总结了强化学习与生成模型交叉领域面临的开放挑战及未来研究方向。