Foundation models must handle multiple generative processes, yet mechanistic interpretability largely studies capabilities in isolation; it remains unclear how a single transformer organizes multiple, potentially conflicting "world models". Previous experiments on Othello playing neural-networks test world-model learning but focus on a single game with a single set of rules. We introduce MetaOthello, a controlled suite of Othello variants with shared syntax but different rules or tokenizations, and train small GPTs on mixed-variant data to study how multiple world models are organized in a shared representation space. We find that transformers trained on mixed-game data do not partition their capacity into isolated sub-models; instead, they converge on a mostly shared board-state representation that transfers causally across variants. Linear probes trained on one variant can intervene on another's internal state with effectiveness approaching that of matched probes. For isomorphic games with token remapping, representations are equivalent up to a single orthogonal rotation that generalizes across layers. When rules partially overlap, early layers maintain game-agnostic representations while a middle layer identifies game identity, and later layers specialize. MetaOthello offers a path toward understanding not just whether transformers learn world models, but how they organize many at once.
翻译:基础模型必须处理多种生成过程,然而机制可解释性研究大多孤立地考察其能力;单个Transformer如何组织多个可能相互冲突的“世界模型”仍不明确。先前对奥赛罗神经网络的研究虽测试了世界模型学习,但仅聚焦于具有单一规则集的单一游戏。我们提出MetaOthello——一套具有共享语法但规则或标记化方式不同的受控奥赛罗变体,通过在混合变体数据上训练小型GPT模型,研究多个世界模型如何在共享表示空间中被组织。研究发现:在混合游戏数据上训练的Transformer并未将其能力划分为孤立子模型;相反,它们收敛于一个基本共享的棋盘状态表示,该表示能在不同变体间进行因果迁移。基于某一变体训练的线性探针可干预另一变体的内部状态,其效果接近匹配探针的水平。对于存在标记重映射的同构游戏,其表示仅相差一个可跨层泛化的正交旋转。当规则部分重叠时,早期层保持游戏无关的表示,中间层识别游戏身份,而后期层则进行专门化处理。MetaOthello为理解Transformer不仅是否学习世界模型,更如何同时组织多个世界模型提供了路径。