Protein language models (pLMs) can generate novel protein sequences with properties beyond those observed in nature, yet the mechanisms underlying protein generation remain poorly understood. Existing mechanistic interpretability methods based on sparse autoencoders and transcoders primarily focus on protein representation learning models and do not capture the computation required for autoregressive generation. Here, we introduce ProGenMech, a mechanistic interpretability framework for generative protein language models that extends cross-layer transcoders (CLTs) to ProGen3, a sparse Mixture-of-Experts model trained for both causal generation and span infilling. Unlike per-layer approaches, CLTs reconstruct each layer using sparse latent variables from all preceding layers, enabling faithful recovery of inter-layer generative computation. We further develop a zero-shot circuit discovery framework to identify sparse latent circuits responsible for protein generation and fitness prediction. In causal generation and zero-shot fitness estimation tasks, ProGenMech outperforms local transcoder baselines in recovering ProGen3's probability distribution and functional scoring behavior, while matching the original model's generative distribution in span infilling tasks. Moreover, the recovered circuits reveal biologically meaningful motifs and functional regions associated with conserved sequence patterns and protein fitness landscapes, establishing a foundation for interpretable and steerable protein generation.
翻译:蛋白质语言模型能够生成具有自然界未观察到的特性的新型蛋白质序列,然而蛋白质生成背后的机制仍知之甚少。现有基于稀疏自编码器和跨层编码器的机械可解释性方法主要关注蛋白质表示学习模型,未能捕捉自回归生成所需的计算过程。本文提出ProGenMech——一种面向生成式蛋白质语言模型的机械可解释性框架,将跨层编码器扩展至ProGen3(一种为因果生成和跨度填充训练的稀疏专家混合模型)。与逐层方法不同,跨层编码器利用所有前置层的稀疏潜变量重构每一层,从而能够忠实地恢复跨层的生成计算过程。我们进一步开发了零样本电路发现框架,用于识别负责蛋白质生成和适应度预测的稀疏潜变量电路。在因果生成和零样本适应度估计任务中,ProGenMech在恢复ProGen3的概率分布和功能评分行为方面优于局部跨层编码器基线,同时在跨度填充任务中与原模型的生成分布相匹配。此外,恢复出的电路揭示了与保守序列模式和蛋白质适应度景观相关的生物学意义基序和功能区域,为可解释且可操控的蛋白质生成奠定了坚实基础。