Recent alignment studies commonly remove introductory boilerplate phrases from supervised fine-tuning (SFT) datasets. This work challenges that assumption. We hypothesize that safety- and reasoning-oriented prefix sentences serve as lightweight alignment signals that can guide model decoding toward safer and more coherent responses. To examine this, we fine-tune three R1 series models across three core model capabilities: reasoning (mathematics, coding), safety, and factuality, systematically varying prefix inclusion from 0% to 100%. Results show that prefix-conditioned SFT improves both safety and reasoning performance, yielding up to +6% higher Safe@1 accuracy on adversarial benchmarks (WildJailbreak, StrongReject) and +7% improvement on GSM8K reasoning. However, factuality and coding tasks show marginal or negative effects, indicating that prefix-induced narrowing of the search space benefits structured reasoning. Token-level loss analysis further reveals that prefix tokens such as "revised" and "logically" incur higher gradient magnitudes, acting as alignment anchors that stabilize reasoning trajectories. Our findings suggest that prefix conditioning offers a scalable and interpretable mechanism for improving reasoning safety, serving as an implicit form of alignment that complements traditional reward-based methods.
翻译:近期对齐研究通常从监督微调(SFT)数据集中移除引导性模板短语。本研究对此假设提出质疑。我们假设面向安全与推理的前缀语句可作为轻量级对齐信号,引导模型解码生成更安全、更连贯的响应。为验证此假设,我们在推理(数学、编程)、安全性和事实性三个核心模型能力维度上,对三个R1系列模型进行微调,系统性地将前缀包含比例从0%调整至100%。实验结果表明:前缀条件化SFT能同时提升安全性与推理性能,在对抗性基准测试(WildJailbreak、StrongReject)中使Safe@1准确率最高提升+6%,在GSM8K推理任务中提升+7%。然而,事实性任务与编程任务仅呈现边际改善或负面效应,表明前缀诱导的搜索空间窄化有利于结构化推理。词元级损失分析进一步揭示,"revised"、"logically"等前缀词元具有更高的梯度幅值,可作为稳定推理轨迹的对齐锚点。本研究结果表明,前缀条件化为提升推理安全性提供了可扩展且可解释的机制,可作为传统基于奖励方法的补充,形成隐式对齐范式。