AI led chess systems to a superhuman level, yet these systems heavily rely on black-box algorithms. This is unsustainable in ensuring transparency to the end-user, particularly when these systems are responsible for sensitive decision-making. Recent interpretability work has shown that the inner representations of Deep Neural Networks (DNNs) were fathomable and contained human-understandable concepts. Yet, these methods are seldom contextualised and are often based on a single hidden state, which makes them unable to interpret multi-step reasoning, e.g. planning. In this respect, we propose contrastive sparse autoencoders (CSAE), a novel framework for studying pairs of game trajectories. Using CSAE, we are able to extract and interpret concepts that are meaningful to the chess-agent plans. We primarily focused on a qualitative analysis of the CSAE features before proposing an automated feature taxonomy. Furthermore, to evaluate the quality of our trained CSAE, we devise sanity checks to wave spurious correlations in our results.
翻译:人工智能已将国际象棋系统推至超人类水平,但这些系统严重依赖黑箱算法。这在确保最终用户透明度方面是不可持续的,尤其当这些系统负责敏感决策时。近期可解释性研究表明,深度神经网络(DNNs)的内部表征是可解析的,且包含人类可理解的概念。然而,这些方法鲜少结合具体情境,且通常基于单一隐藏状态,导致其无法解释多步推理(如规划过程)。为此,我们提出对比稀疏自编码器(CSAE)——一种用于研究博弈轨迹对的新型框架。借助CSAE,我们能够提取并解释对国际象棋智能体规划具有意义的概念。我们首先对CSAE特征进行定性分析,进而提出自动化特征分类体系。此外,为评估所训练CSAE的质量,我们设计了完整性检验以排除结果中的虚假相关性。