Language models have shown unprecedented capabilities, sparking debate over the source of their performance. Is it merely the outcome of learning syntactic patterns and surface level statistics, or do they extract semantics and a world model from the text? Prior work by Li et al. investigated this by training a GPT model on synthetic, randomly generated Othello games and found that the model learned an internal representation of the board state. We extend this work into the more complex domain of chess, training on real games and investigating our model's internal representations using linear probes and contrastive activations. The model is given no a priori knowledge of the game and is solely trained on next character prediction, yet we find evidence of internal representations of board state. We validate these internal representations by using them to make interventions on the model's activations and edit its internal board state. Unlike Li et al's prior synthetic dataset approach, our analysis finds that the model also learns to estimate latent variables like player skill to better predict the next character. We derive a player skill vector and add it to the model, improving the model's win rate by up to 2.6 times.
翻译:语言模型展现出前所未有的能力,引发了关于其性能来源的争论:这仅仅是学习句法模式和表层统计的结果,还是模型从文本中提取了语义信息并构建了世界模型?Li等人先前的研究通过使用随机生成的奥赛罗棋合成数据训练GPT模型,发现模型习得了棋盘状态的内部表征。我们将此项研究拓展至更复杂的象棋领域,使用真实对局数据进行训练,并通过线性探针与对比激活方法探究模型的内部表征。模型未获得任何关于象棋的先验知识,仅通过下一字符预测任务进行训练,但我们发现了其内部存在棋盘状态表征的证据。我们通过干预模型激活并编辑其内部棋盘状态来验证这些表征的有效性。与Li等人先前采用的合成数据集方法不同,我们的分析发现模型还学会了估计玩家水平等隐变量以提升下一字符预测性能。我们推导出玩家水平向量并将其融入模型,使模型的胜率最高提升至2.6倍。