Cross-entropy (CE) training provides dense and scalable supervision for language models, but it optimizes next-token prediction under teacher forcing rather than sequence-level behavior under model rollouts. We introduce a feature-matching objective for language-model fine-tuning that targets sequence-level statistics of the completion distribution, providing dense semantic feedback without requiring a task-specific verifier or preference model. To optimize this objective efficiently, we propose energy-based fine-tuning (EBFT), which uses strided block-parallel sampling to generate multiple rollouts from nested prefixes concurrently, batches feature extraction over these rollouts, and uses the resulting embeddings to perform an on-policy policy-gradient update. We present a theoretical perspective connecting EBFT to KL-regularized feature-matching and energy-based modeling. Empirically, across Q&A coding, unstructured coding, and translation, EBFT matches RLVR and outperforms SFT on downstream accuracy while achieving a lower validation cross-entropy than both methods.
翻译:交叉熵训练为语言模型提供了密集且可扩展的监督信号,但其优化的是教师强制下的下一词元预测,而非模型自回归生成下的序列级行为。我们提出一种用于语言模型微调的特征匹配目标,该目标针对生成分布中的序列级统计量,在无需任务特定验证器或偏好模型的情况下,提供密集的语义反馈。为高效优化此目标,我们提出能量微调方法,该方法利用跨步并行块采样从嵌套前缀中并发生成多个自回归序列,对这些序列进行批量特征提取,并利用所得嵌入执行基于策略梯度的同策略更新。我们从理论角度阐述了EBFT与KL正则化特征匹配及能量建模之间的联系。在问答式代码生成、非结构化代码生成及翻译等任务的实证评估中,EBFT在保持低于监督微调及强化学习验证器微调验证交叉熵的同时,在下游任务准确率上达到与RLVR相当且优于SFT的性能表现。