Large Language Models (LLMs) tend to respond correctly to prompts that align well with the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO constructs f-divergence ambiguity sets over span-level actor losses, providing a principled way to emphasize difficult response segments during policy optimization. Empirically, DRTO enhances consistency under distribution shifts in multiple reasoning benchmarks among different tasks, achieving $+4.4$ percentage points on MATH-500 and $+2.7$ percentage points on LiveCodeBench over standard RTO.
翻译:大型语言模型(LLMs)倾向于对与其训练和微调数据高度一致的提示做出正确响应。然而,措辞、格式或语言的微小变化可能引发严重失败,尤其是在多步推理问题上。为解决该问题,我们提出了一种分布式鲁棒令牌优化(DRTO)方法,该方法将基于令牌的人类反馈强化学习(RLHF)与分布式鲁棒优化(DRO)相结合。DRTO在跨度级演员损失上构建f-散度模糊集,为策略优化中强调困难响应片段提供了一种原则性方法。实验表明,DRTO在不同任务的多项推理基准测试中增强了分布偏移下的鲁棒性,相较于标准RTO,在MATH-500上提升了+4.4个百分点,在LiveCodeBench上提升了+2.7个百分点。