While large language models (LLMs) have emerged as powerful decision-makers across a wide range of single-agent and stationary environments, fewer efforts have been devoted to settings where LLMs must engage in \emph{repeated} and \emph{strategic} interactions with unknown or dynamic opponents. In such settings, recipes built upon \emph{offline} pre-training or fine-tuning, though robust against worst-case adversaries, do not fully exploit the capability of LLMs to adapt \emph{online} based on interaction feedback. Instead, we explore the more natural perspective of scaling inference-time computation as a mechanism for adaptation, embedding the principles of a classical game-theoretical learning dynamic, \emph{smooth Fictitious Play (sFP)}, into LLM inference: (i) for belief formation, we employ an auxiliary opponent model that in-context learns to imitate the time-averaged behavior of the opponent; (ii) for best response, we advance best-of-$N$ (BoN) sampling by simulating against the opponent model. Empirical evaluations on two distinct forms of repeated negotiation games demonstrate that our method enables significant performance improvement over repeated online interaction compared to various baselines, offering a scalable and principled approach to repeated strategic decision-making without any parameter updates.
翻译:尽管大型语言模型(LLM)已在广泛的单智能体与静态环境中展现出强大的决策能力,但针对LLM需与未知或动态对手进行重复性策略交互场景的研究仍相对有限。在此类场景中,基于离线预训练或微调的方法虽能有效抵御最坏情况下的对抗,却未能充分发挥LLM基于交互反馈进行在线适应的潜力。为此,我们探索了一种更自然的视角——将扩展推理时计算作为适应机制,并将经典博弈论学习动态平滑虚拟博弈(sFP)的核心原理嵌入LLM推理过程:(i)在信念形成阶段,我们采用辅助对手模型通过上下文学习模拟对手的时间平均行为;(ii)在最优响应阶段,我们通过与该对手模型进行模拟对抗,改进了最佳N采样(BoN)方法。在两个不同形式的重复谈判博弈中的实证评估表明,相较于多种基线方法,我们的方法能在重复在线交互中实现显著的性能提升,为无需参数更新的重复策略决策提供了一种可扩展且具有理论依据的解决方案。