Iterative Direct Preference Optimization has emerged as the state-of-the-art paradigm for aligning Large Language Models on reasoning tasks. Standard implementations (DPO-R1) rely on Best-of-N sampling (e.g., $N \ge 8$) to mine golden trajectories from the distribution tail. In this paper, we challenge this scaling hypothesis and reveal a counter-intuitive phenomenon: in mathematical reasoning, aggressive exploration yields diminishing returns and even catastrophic policy collapse. We theoretically demonstrate that scaling $N$ amplifies verifier noise and induces detrimental distribution shifts. To resolve this, we introduce \textbf{PACE} (Proximal Alignment via Corrective Exploration), which replaces brute-force mining with a generation-based corrective strategy. Operating with a minimal budget ($2<N<3$), PACE synthesizes high-fidelity preference pairs from failed explorations. Empirical evaluations show that PACE outperforms DPO-R1 $(N=16)$ while using only about $1/5$ of the compute, demonstrating superior robustness against reward hacking and label noise.
翻译:迭代直接偏好优化已成为在推理任务上对齐大型语言模型的最先进范式。标准实现(DPO-R1)依赖于最佳N采样(例如 $N \ge 8$)从分布尾部挖掘黄金轨迹。本文中,我们挑战这一缩放假设,并揭示了一个反直觉现象:在数学推理中,激进的探索会产生收益递减,甚至导致灾难性的策略崩溃。我们从理论上证明,扩大 $N$ 会放大验证器噪声并引发有害的分布偏移。为解决此问题,我们提出了 \textbf{PACE}(通过校正探索的邻近对齐),它用基于生成的校正策略取代了暴力挖掘。PACE 以最小预算($2<N<3$)运行,从失败的探索中合成高保真偏好对。实证评估表明,PACE 的性能优于 DPO-R1 $(N=16)$,同时仅使用约 $1/5$ 的计算量,展现出对抗奖励黑客攻击和标签噪声的卓越鲁棒性。