Game-playing agents like AlphaGo have achieved superhuman performance through self-play, which is theoretically guaranteed to yield optimal policies in competitive games. However, most language tasks are partially or fully cooperative, so it is an open question whether techniques like self-play can effectively be used to improve language models. We empirically investigate this question in a negotiation game setting known as Deal or No Deal (DoND). Crucially, the objective in DoND can be modified to produce a fully cooperative game, a strictly competitive one, or anything in between. We finetune language models in self-play over multiple rounds of filtered behavior cloning in DoND for each of these objectives and evaluate them in self-play and in collaboration with humans. We find that language models improve substantially in self-play, achieving 14-17x higher scores in task reward after finetuning. Further, the trained models generalize to both cooperation and competition with humans, scoring 2.5-6x higher than base models. We view these results as an early promising sign for language model self-play in cooperative settings, despite a lack of theoretical guarantees.
翻译:AlphaGo等博弈智能体通过自我博弈实现了超越人类的表现,理论上这种方法在竞争性博弈中能够保证获得最优策略。然而,大多数语言任务具有部分或完全的合作性质,因此自我博弈等技术能否有效提升语言模型性能仍是一个开放性问题。我们在名为"成交与否"(Deal or No Deal, DoND)的谈判博弈场景中对这一问题进行了实证研究。关键在于,DoND的目标函数可被修改以构建完全合作型博弈、严格竞争型博弈或介于两者之间的任何博弈形式。我们针对每种目标函数,在DoND中通过多轮过滤行为克隆对语言模型进行自我博弈微调,并在自我博弈和与人类协作两种模式下评估模型性能。研究发现,语言模型在自我博弈中取得显著提升,微调后的任务奖励得分达到基准的14-17倍。此外,训练后的模型能够泛化至与人类的合作和竞争场景,其得分达到基础模型的2.5-6倍。尽管缺乏理论保证,我们认为这些结果为合作场景下的语言模型自我博弈研究提供了早期积极信号。