Large language models (LLMs) have been increasingly employed for (interactive) decision-making, via the development of LLM-based autonomous agents. Despite their emerging successes, the performance of LLM agents in decision-making has not been fully investigated through quantitative metrics, especially in the multi-agent setting when they interact with each other, a typical scenario in real-world LLM-agent applications. To better understand the limits of LLM agents in these interactive environments, we propose to study their interactions in benchmark decision-making settings in online learning and game theory, through the performance metric of \emph{regret}. We first empirically study the {no-regret} behaviors of LLMs in canonical (non-stationary) online learning problems, as well as the emergence of equilibria when LLM agents interact through playing repeated games. We then provide some theoretical insights into the no-regret behaviors of LLM agents, under certain assumptions on the supervised pre-training and the rationality model of human decision-makers who generate the data. Notably, we also identify (simple) cases where advanced LLMs such as GPT-4 fail to be no-regret. To promote the no-regret behaviors, we propose a novel \emph{unsupervised} training loss of \emph{regret-loss}, which, in contrast to the supervised pre-training loss, does not require the labels of (optimal) actions. We then establish the statistical guarantee of generalization bound for regret-loss minimization, followed by the optimization guarantee that minimizing such a loss may automatically lead to known no-regret learning algorithms. Our further experiments demonstrate the effectiveness of our regret-loss, especially in addressing the above ``regrettable'' cases.
翻译:大型语言模型(LLM)已被日益广泛地用于(交互式)决策任务,具体形式则是基于LLM的自主智能体。尽管取得了初步成功,但LLM智能体在决策中的表现尚未通过定量指标得到充分研究,尤其是在多智能体相互交互的场景中——这是真实世界LLM智能体应用的典型情境。为更深入理解LLM智能体在这些交互环境中的能力边界,我们提出通过在线学习与博弈论中的基准决策环境,以“遗憾”(regret)这一性能指标来研究其交互行为。我们首先实证研究了LLM在经典(非平稳)在线学习问题中的无遗憾行为,以及在重复博弈中LLM智能体交互时均衡的出现。随后,在关于监督预训练和生成数据的人类决策者理性模型的特定假设下,我们从理论上分析了LLM智能体无遗憾行为的机理。值得注意的是,我们发现了(简单)案例——即使是GPT-4等先进LLM也无法实现无遗憾。为促进无遗憾行为,我们提出了一种新颖的“无监督”训练损失——遗憾损失(regret-loss),与监督预训练损失不同,它无需(最优)动作的标签。我们进一步建立了遗憾损失最小化的泛化误差界统计保证,并证明最小化该损失可自动推导出已知的无遗憾学习算法。后续实验验证了遗憾损失的有效性,特别是在解决上述“令人遗憾”的案例时。