We focus on the problem of \emph{Answer-Level Fine-Tuning} (ALFT), where the goal is to optimize a language model based on the correctness or properties of its final answers, rather than the specific reasoning traces used to produce them. Directly optimizing answer-level objectives is computationally intractable due to the need to marginalize over the vast space of latent reasoning paths. To overcome this, we propose a general game-theoretical framework that lifts the problem to a \emph{Distributional Alignment Game}. We formulate ALFT as a two-player game between a Policy (the generator) and a Target (an auxiliary distribution). We prove that the Nash Equilibrium of this game corresponds exactly to the solution of the original answer-level optimization problem. This variational perspective transforms the intractable marginalization problem into a tractable projection problem. We demonstrate that this framework unifies recent approaches to diversity and self-improvement (coherence) and provide efficient algorithms compatible with Group Relative Policy Optimization (GRPO), such as Coherence-GRPO, yielding significant complexity gains in mathematical reasoning tasks.
翻译:我们关注答案级微调(Answer-Level Fine-Tuning, ALFT)问题,其目标是根据语言模型最终答案的正确性或属性(而非产生这些答案的具体推理轨迹)来优化模型。由于需要对潜推理路径的广阔空间进行边际化处理,直接优化答案级目标在计算上是不可行的。为克服这一难题,我们提出了一种通用的博弈论框架,将问题提升为一种"分布对齐博弈"(Distributional Alignment Game)。我们将ALFT形式化为一个双人博弈:博弈双方为策略网络(生成器)与目标网络(辅助分布)。我们证明,该博弈的纳什均衡恰好对应于原始答案级优化问题的解。这一变分视角将不可处理的边际化问题转化为可处理的投影问题。我们证明,该框架统一了近期关于多样性与自我改进(一致性)的方法,并提供了与群体相对策略优化(Group Relative Policy Optimization, GRPO)兼容的高效算法(如Coherence-GRPO),在数学推理任务中实现了显著的复杂度增益。