Transformer models underpin many recent advances in practical machine learning applications, yet understanding their internal behavior continues to elude researchers. Given the size and complexity of these models, forming a comprehensive picture of their inner workings remains a significant challenge. To this end, we set out to understand small transformer models in a more tractable setting: that of solving mazes. In this work, we focus on the abstractions formed by these models and find evidence for the consistent emergence of structured internal representations of maze topology and valid paths. We demonstrate this by showing that the residual stream of only a single token can be linearly decoded to faithfully reconstruct the entire maze. We also find that the learned embeddings of individual tokens have spatial structure. Furthermore, we take steps towards deciphering the circuity of path-following by identifying attention heads (dubbed $\textit{adjacency heads}$), which are implicated in finding valid subsequent tokens.
翻译:Transformer模型支撐了近期許多實際機器學習應用的進展,然而理解其內部行為仍困擾著研究者。由於這些模型的規模與複雜性,全面描繪其內部運作機制仍是一項重大挑戰。為此,我們在一個更易於處理的場景——求解迷宮——中著手理解小型Transformer模型。在本研究中,我們聚焦於這些模型所形成的抽象概念,並發現迷宮拓撲結構與有效路徑的結構化內部表徵持續湧現的證據。我們透過展示僅需單一token的殘差流即可線性解碼以忠實重建整個迷宮來證明這一點。我們還發現個別token的學習嵌入具有空間結構。此外,我們透過識別注意力頭(稱為$\textit{鄰接頭}$)來邁向解讀路徑追蹤電路,這些注意力頭與尋找有效後續token相關。