Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs). However, the conventional SFT process, driven by Cross-Entropy (CE) loss, often induces mode collapse, where models over-concentrate on specific response patterns. This lack of distributional diversity severely restricts the exploration efficiency required for subsequent RL. While recent studies have attempted to improve SFT by replacing the CE loss, aiming to preserve diversity or refine the update policy, they fail to adequately balance diversity and accuracy, thereby yielding suboptimal performance after RL. To address the mode collapse problem, we propose SED-SFT, which adaptively encourages diversity based on the token exploration space. This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective. Extensive experiments across eight mathematical benchmarks demonstrate that SED-SFT significantly enhances generation diversity with a negligible computational overhead increase compared with CE loss, yielding average improvements of 2.06 and 1.20 points in subsequent RL performance over standard CE-based baselines on Llama-3.2-3B-Instruct and Qwen2.5-Math-7B-Instruct, respectively. The code is publicly available at https://github.com/pppa2019/SED-SFT
翻译:监督微调(SFT)与强化学习(RL)的结合已成为大型语言模型(LLM)后训练的标准范式。然而,由交叉熵(CE)损失驱动的传统SFT过程常常导致模式崩溃,即模型过度集中于特定的响应模式。这种分布多样性的缺乏严重限制了后续RL所需的探索效率。尽管近期研究尝试通过替换CE损失来改进SFT,旨在保持多样性或优化更新策略,但它们未能充分平衡多样性与准确性,从而导致RL后性能欠佳。为解决模式崩溃问题,我们提出SED-SFT,该方法基于词元探索空间自适应地促进多样性。该框架在优化目标中引入了带有选择性掩蔽机制的选择性熵正则化项。在八个数学基准测试上的广泛实验表明,与CE损失相比,SED-SFT以可忽略的计算开销增长显著提升了生成多样性,在Llama-3.2-3B-Instruct和Qwen2.5-Math-7B-Instruct模型上,其后续RL性能相比基于CE的标准基线分别平均提升了2.06分和1.20分。代码已公开于https://github.com/pppa2019/SED-SFT。