This paper uses chess, a landmark planning problem in AI, to assess transformers' performance on a planning task where memorization is futile $\unicode{x2013}$ even at a large scale. To this end, we release ChessBench, a large-scale benchmark dataset of 10 million chess games with legal move and value annotations (15 billion data points) provided by Stockfish 16, the state-of-the-art chess engine. We train transformers with up to 270 million parameters on ChessBench via supervised learning and perform extensive ablations to assess the impact of dataset size, model size, architecture type, and different prediction targets (state-values, action-values, and behavioral cloning). Our largest models learn to predict action-values for novel boards quite accurately, implying highly non-trivial generalization. Despite performing no explicit search, our resulting chess policy solves challenging chess puzzles and achieves a surprisingly strong Lichess blitz Elo of 2895 against humans (grandmaster level). We also compare to Leela Chess Zero and AlphaZero (trained without supervision via self-play) with and without search. We show that, although a remarkably good approximation of Stockfish's search-based algorithm can be distilled into large-scale transformers via supervised learning, perfect distillation is still beyond reach, thus making ChessBench well-suited for future research.
翻译:本文以国际象棋——人工智能领域一个标志性的规划问题——为研究对象,评估Transformer在无法通过记忆解决的规划任务上的性能表现,即使在大规模模型下亦是如此。为此,我们发布了ChessBench,一个包含1000万局国际象棋对弈的大规模基准数据集,其中所有合法走法及局面估值标注(共计150亿个数据点)均由当前最先进的国际象棋引擎Stockfish 16提供。我们通过监督学习在ChessBench上训练了参数规模高达2.7亿的Transformer模型,并进行了广泛的消融实验,以评估数据集规模、模型规模、架构类型以及不同预测目标(状态价值、行动价值与行为克隆)的影响。我们最大的模型能够相当准确地预测新棋局下的行动价值,这意味着模型实现了高度非平凡的泛化能力。尽管未执行任何显式搜索,我们最终得到的国际象棋策略能够解决具有挑战性的国际象棋残局,并在Lichess快棋等级分评测中达到了令人惊讶的2895分(相当于人类特级大师水平)。我们还与通过自我对弈无监督训练的Leela Chess Zero和AlphaZero进行了比较,包括使用和不使用搜索的情况。研究表明,尽管通过监督学习可以将Stockfish基于搜索的算法以相当好的近似程度提炼到大规模Transformer中,但实现完美的提炼仍然遥不可及,这使得ChessBench非常适合用于未来的研究。