Game-playing agents like AlphaGo have achieved superhuman performance through self-play, which is theoretically guaranteed to yield optimal policies in competitive games. However, most language tasks are partially or fully cooperative, so it is an open question whether techniques like self-play can effectively be used to improve language models. We empirically investigate this question in a negotiation game setting known as Deal or No Deal (DoND). Crucially, the objective in DoND can be modified to produce a fully cooperative game, a strictly competitive one, or anything in between. We finetune language models in self-play over multiple rounds of filtered behavior cloning in DoND for each of these objectives. Contrary to expectations, we find that language model self-play leads to significant performance gains in both cooperation and competition with humans, suggesting that self-play and related techniques have promise despite a lack of theoretical guarantees.
翻译:AlphaGo等博弈智能体通过自我博弈实现了超越人类的表现,理论上该方法在竞争性游戏中能产生最优策略。然而,大多数语言任务具有部分或完全的合作性质,因此自我博弈等技术能否有效提升语言模型性能仍是一个开放性问题。我们在名为"成交或不成交"(DoND)的谈判游戏环境中对此问题进行了实证研究。关键在于,DoND的目标函数可被修改为完全合作型游戏、严格竞争型游戏或介于两者之间的任意形态。我们针对每种目标函数,在DoND中通过多轮过滤行为克隆的自我博弈对语言模型进行微调。与预期相反,我们发现语言模型自我博弈在人类合作与竞争场景中均能带来显著的性能提升,这表明尽管缺乏理论保证,自我博弈及相关技术仍具有发展潜力。