Large transformers are powerful architectures for self-supervised analysis of data of various nature, ranging from protein sequences to text to images. In these models, the data representation in the hidden layers live in the same space, and the semantic structure of the dataset emerges by a sequence of functionally identical transformations between one representation and the next. We here characterize the geometric and statistical properties of these representations, focusing on the evolution of such proprieties across the layers. By analyzing geometric properties such as the intrinsic dimension (ID) and the neighbor composition we find that the representations evolve in a strikingly similar manner in transformers trained on protein language tasks and image reconstruction tasks. In the first layers, the data manifold expands, becoming high-dimensional, and then it contracts significantly in the intermediate layers. In the last part of the model, the ID remains approximately constant or forms a second shallow peak. We show that the semantic complexity of the dataset emerges at the end of the first peak. This phenomenon can be observed across many models trained on diverse datasets. Based on these observations, we suggest using the ID profile as an unsupervised proxy to identify the layers which are more suitable for downstream learning tasks.
翻译:大型Transformer是用于对从蛋白质序列到文本、图像等各种类型数据进行自监督分析的强大架构。在这些模型中,隐藏层中的数据表示存在于相同空间中,数据集的语义结构通过一系列功能相同的变换在连续表示之间逐渐涌现。本文刻画了这些表示的几何与统计特性,重点关注这些特性在不同层间的演化过程。通过分析内在维度(ID)和邻居组成等几何性质,我们发现:在基于蛋白质语言任务和图像重建任务训练的Transformer中,表示以惊人相似的方式演化。在初始层中,数据流形膨胀并呈现高维特征,随后在中间层显著收缩。在模型最后部分,ID保持近似恒定或形成第二个浅峰。我们证明数据集语义复杂度在第一个峰值末端涌现。该现象可在多个基于不同数据集训练的模型中被观察到。基于这些发现,我们建议使用ID分布曲线作为无监督的代理指标来识别更适合下游学习任务的层。