Mechanisms for continued self-improvement of language models without external supervision remain an open challenge. We propose Peer-Predictive Self-Training (PST), a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses in practice, serves as an internal target for learning. We measure how informative each intermediate response is about the aggregate using pointwise mutual information (PMI), and use this signal to scale self-training updates. Responses already aligned with the aggregate are updated less, while less informative or misaligned responses are updated more. On mathematical reasoning benchmarks (SimulEq, Math500, and MultiArith), PST improves exact-match accuracy by 2.2 to 4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, and reduces the average generator-verifier gap (GV-Gap) by 26 to 40 percent, while requiring no external supervision or teacher-student hierarchy and relying solely on cross-model interactions. These results suggest that cross-model generations and peer-predictive feedback can serve as an effective approach for self-supervised training.
翻译:实现语言模型在无外部监督下持续自我提升的机制仍是一个开放性挑战。本文提出"同伴预测自训练"(Peer-Predictive Self-Training, PST),一种无标签的微调框架,其中多个语言模型通过利用跨模型聚合响应作为内部训练信号进行协同改进。给定提示问题后,模型依次生成响应;最终聚合答案(实践中通常比单个响应更可靠)作为学习的内部分目标。我们使用点互信息(PMI)衡量每个中间响应关于聚合结果的信息量,并以此信号缩放自训练更新步长:与聚合结果一致的响应更新幅度较小,而信息量不足或存在偏差的响应则获得更大更新。在数学推理基准(SimulEq、Math500和MultiArith)上,PST使Gemma-2-2B、LLaMA-3.2-1B和Qwen-2.5-1.5B的精确匹配准确率提升2.2至4.3个百分点,并将平均生成-验证差距(GV-Gap)降低26%至40%,全程无需外部监督或师生层级架构,仅依赖跨模型交互。这些结果表明,跨模型生成与同伴预测反馈可作为自监督训练的有效方法。