Protein sequences are abundant in repeating segments, both as exact copies and as approximate segments with mutations. These repeats are important for protein structure and function, motivating decades of algorithmic work on repeat identification. Recent work has shown that protein language models (PLMs) identify repeats, by examining their behavior in masked-token prediction. To elucidate their internal mechanisms, we investigate how PLMs detect both exact and approximate repeats. We find that the mechanism for approximate repeats functionally subsumes that of exact repeats. We then characterize this mechanism, revealing two main stages: PLMs first build feature representations using both general positional attention heads and biologically specialized components, such as neurons that encode amino-acid similarity. Then, induction heads attend to aligned tokens across repeated segments, promoting the correct answer. Our results reveal how PLMs solve this biological task by combining language-based pattern matching with specialized biological knowledge, thereby establishing a basis for studying more complex evolutionary processes in PLMs.
翻译:蛋白质序列中富含重复片段,既包括精确拷贝,也包含带有突变的近似片段。这些重复对蛋白质的结构与功能至关重要,推动了数十年来关于重复序列识别的算法研究。近期研究表明,蛋白质语言模型(PLMs)能够通过分析其在掩码标记预测中的行为来识别重复序列。为阐明其内部机制,我们研究了PLMs如何检测精确重复与近似重复。我们发现,近似重复的检测机制在功能上包含了精确重复的检测机制。随后我们对该机制进行表征,揭示出两个主要阶段:PLMs首先利用通用位置注意力头与生物学特化组件(如编码氨基酸相似性的神经元)构建特征表示;接着,归纳注意力头关注重复片段间的对齐标记,从而促进正确答案的生成。我们的研究结果揭示了PLMs如何通过结合基于语言的模式匹配与专业生物学知识来解决这一生物学任务,从而为研究PLMs中更复杂的进化过程奠定了基础。