Pre-trained Large Language Model (LLM) exhibits broad capabilities, yet, for specific tasks or domains their attainment of higher accuracy and more reliable reasoning generally depends on post-training through Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). Although often treated as distinct methodologies, recent theoretical and empirical developments demonstrate that SFT and RL are closely connected. This study presents a comprehensive and unified perspective on LLM post-training with SFT and RL. We first provide an in-depth overview of both techniques, examining their objectives, algorithmic structures, and data requirements. We then systematically analyze their interplay, highlighting frameworks that integrate SFT and RL, hybrid training pipelines, and methods that leverage their complementary strengths. Drawing on a representative set of recent application studies from 2023 to 2025, we identify emerging trends, characterize the rapid shift toward hybrid post-training paradigms, and distill key takeaways that clarify when and why each method is most effective. By synthesizing theoretical insights, practical methodologies, and empirical evidence, this study establishes a coherent understanding of SFT and RL within a unified framework and outlines promising directions for future research in scalable, efficient, and generalizable LLM post-training.
翻译:预训练大型语言模型(LLM)展现出广泛的能力,然而,针对特定任务或领域,要获得更高的准确性和更可靠的推理能力,通常需要通过监督微调(SFT)或强化学习(RL)进行后训练。尽管这两种方法常被视为截然不同的技术,但近期的理论和实证进展表明,SFT与RL之间存在着紧密联系。本研究为基于SFT和RL的LLM后训练提供了一个全面且统一的视角。我们首先深入概述这两种技术,审视其目标、算法结构和数据需求。随后,我们系统性地分析它们的相互作用,重点介绍了整合SFT与RL的框架、混合训练流程以及利用二者互补优势的方法。通过分析2023年至2025年一系列具有代表性的近期应用研究,我们识别了新兴趋势,描述了向混合后训练范式快速转变的特征,并提炼出关键要点,以阐明每种方法在何时以及为何最为有效。通过综合理论见解、实践方法和实证证据,本研究在一个统一框架内建立了对SFT和RL的连贯理解,并为未来可扩展、高效且可泛化的LLM后训练研究指明了有前景的方向。