Large Language Models (LLMs) have demonstrated strong generative capabilities but remain prone to inconsistencies and hallucinations. We introduce Peer Elicitation Games (PEG), a training-free, game-theoretic framework for aligning LLMs through a peer elicitation mechanism involving a generator and multiple discriminators instantiated from distinct base models. Discriminators interact in a peer evaluation setting, where utilities are computed using a determinant-based mutual information score that provably incentivizes truthful reporting without requiring ground-truth labels. We establish theoretical guarantees showing that each agent, via online learning, achieves sublinear regret in the sense their cumulative performance approaches that of the best fixed truthful strategy in hindsight. Moreover, we prove last-iterate convergence to a truthful Nash equilibrium, ensuring that the actual policies used by agents converge to stable and truthful behavior over time. Empirical evaluations across multiple benchmarks demonstrate significant improvements in factual accuracy. These results position PEG as a practical approach for eliciting truthful behavior from LLMs without supervision or fine-tuning.
翻译:大型语言模型(LLMs)已展现出强大的生成能力,但仍存在不一致性和幻觉问题。本文提出同伴启发博弈(PEG),这是一种免训练、基于博弈论的框架,通过包含生成器与多个鉴别器的同伴启发机制对齐LLMs,其中鉴别器由不同的基础模型实例化。鉴别器在同伴评估场景中进行交互,其效用通过基于行列式的互信息评分计算,该评分机制可证明能在无需真实标签的情况下激励真实报告。我们建立了理论保证,表明每个智能体通过在线学习可实现次线性遗憾,即其累积性能逐渐逼近事后最优的固定真实策略。此外,我们证明了末轮迭代收敛于真实纳什均衡,确保智能体实际使用的策略随时间收敛至稳定且真实的行为。在多个基准测试上的实证评估表明,该方法在事实准确性方面取得显著提升。这些结果使PEG成为一种无需监督或微调即可从LLMs中激发真实行为的实用方法。