Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate autoregressive antibody language models, and steer their generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature-concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose a hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs suffice for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.
翻译:稀疏自编码器(SAEs)是一种机制可解释性技术,已被用于揭示大型蛋白质语言模型中的学习概念。本文采用TopK与有序自编码器研究自回归抗体语言模型,并调控其生成过程。研究表明,TopK SAEs能揭示具有生物学意义的潜在特征,但高特征-概念相关性并不能保证对生成过程的因果控制。相比之下,有序自编码器通过构建层次化结构可可靠识别可调控特征,但代价是激活模式更复杂且可解释性降低。这些发现推进了领域特异性蛋白质语言模型的机制可解释性研究,并提示:当需将潜在特征映射至概念时,TopK SAEs已足够;但若需精确调控生成过程,有序自编码器更具优势。