Addressing the critical need for robust safety in Large Language Models (LLMs), particularly against adversarial attacks and in-distribution errors, we introduce Reinforcement Learning with Backtracking Feedback (RLBF). This framework advances upon prior methods, such as BSAFE, by primarily leveraging a Reinforcement Learning (RL) stage where models learn to dynamically correct their own generation errors. Through RL with critic feedback on the model's live outputs, LLMs are trained to identify and recover from their actual, emergent safety violations by emitting an efficient "backtrack by x tokens" signal, then continuing generation autoregressively. This RL process is crucial for instilling resilience against sophisticated adversarial strategies, including middle filling, Greedy Coordinate Gradient (GCG) attacks, and decoding parameter manipulations. To further support the acquisition of this backtracking capability, we also propose an enhanced Supervised Fine-Tuning (SFT) data generation strategy (BSAFE+). This method improves upon previous data creation techniques by injecting violations into coherent, originally safe text, providing more effective initial training for the backtracking mechanism. Comprehensive empirical evaluations demonstrate that RLBF significantly reduces attack success rates across diverse benchmarks and model scales, achieving superior safety outcomes while critically preserving foundational model utility.
翻译:针对大型语言模型(LLMs)在对抗性攻击和分布内错误方面对鲁棒安全性的迫切需求,本文提出基于回溯反馈的强化学习框架。该框架在BSAFE等现有方法基础上进行改进,其核心在于强化学习阶段——模型通过学习动态修正自身生成错误。通过引入对模型实时输出的评判反馈进行强化学习,LLMs能够识别实际发生的安全违规行为,并通过发出高效的“回溯x个词元”信号实现自主恢复,继而以自回归方式继续生成。该强化学习过程对于提升模型抵御复杂对抗策略的能力至关重要,包括中间填充攻击、贪婪坐标梯度攻击以及解码参数操纵等。为有效获取这种回溯能力,本文同时提出增强型监督微调数据生成策略。该方法通过向原本连贯的安全文本中注入违规内容,改进了现有数据生成技术,为回溯机制提供了更有效的初始训练。综合实验评估表明,该框架能显著降低不同基准测试和模型规模下的攻击成功率,在确保基础模型功能完整性的同时实现了更优的安全性能。