We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies that play against frozen KataGo victims. Our attack achieves a >99% win rate when KataGo uses no tree search, and a >97% win rate when KataGo uses enough search to be superhuman. We train our adversaries with a modified KataGo implementation, using less than 14% of the compute used to train the original KataGo. Notably, our adversaries do not win by learning to play Go better than KataGo -- in fact, our adversaries are easily beaten by human amateurs. Instead, our adversaries win by tricking KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is interpretable to the extent that human experts can successfully implement it, without algorithmic assistance, to consistently beat superhuman AIs. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available at https://goattack.far.ai/.
翻译:我们通过训练对抗策略来攻击最先进的围棋AI系统KataGo,这些策略与冻结的KataGo受害者进行对弈。当KataGo不使用树搜索时,我们的攻击实现了>99%的胜率;当KataGo使用足够多的搜索以达到超人类水平时,胜率仍超过97%。我们使用修改后的KataGo实现来训练对手,其计算量不到原始KataGo训练的14%。值得注意的是,我们的对手并非通过学会比KataGo下得更好而获胜——事实上,这些对手很容易被人类业余棋手击败。相反,它们通过诱使KataGo犯下严重错误来取得胜利。我们的攻击可零样本迁移到其他超人类围棋AI,并且其可解释性足以让人类专家无需算法辅助即可成功实施,持续击败超人类AI。我们的结果表明,即使超人类AI系统也可能存在令人惊讶的失败模式。示例对局可访问https://goattack.far.ai/。