Existing policy-gradient methods for auto-regressive language models typically select subsequent tokens one at a time as actions in the policy. While effective for many generation tasks, such an approach may not fully capture the structure of complex reasoning tasks, where a single semantic decision is often realized across multiple tokens--for example, when defining variables or composing equations. This introduces a potential mismatch between token-level optimization and the inherently block-level nature of reasoning in these settings. To bridge this gap, we propose Multi-token Policy Gradient Optimization (MPO), a framework that treats sequences of K consecutive tokens as unified semantic actions. This block-level perspective enables our method to capture the compositional structure of reasoning trajectories and supports optimization over coherent, higher-level objectives. Experiments on mathematical reasoning and coding benchmarks show that MPO outperforms standard token-level policy gradient baselines, highlight the limitations of token-level policy gradients for complex reasoning, motivating future research to look beyond token-level granularity for reasoning-intensive language tasks.
翻译:现有的自回归语言模型策略梯度方法通常将后续词元逐一作为策略中的动作进行选择。尽管这种方法对许多生成任务有效,但在复杂推理任务中可能无法完全捕捉其结构特征,因为单个语义决策往往通过多个词元实现——例如定义变量或组合方程时。这导致了词元级优化与这些场景中固有的块级推理本质之间的潜在不匹配。为弥合这一差距,我们提出了多词元策略梯度优化框架,该框架将连续K个词元序列视为统一的语义动作。这种块级视角使我们的方法能够捕捉推理轨迹的组合结构,并支持对连贯的高层目标进行优化。在数学推理与代码生成基准测试上的实验表明,MPO优于标准的词元级策略梯度基线,凸显了词元级策略梯度在复杂推理中的局限性,从而推动未来研究在推理密集型语言任务中超越词元级粒度进行探索。