Large language models (LLMs) have achieved remarkable progress, yet their internal mechanisms remain largely opaque, posing a significant challenge to their safe and reliable deployment. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing LLM representations into more interpretable features, but explaining the features captured by SAEs remains a challenging task. In this work, we propose SAGE (SAE AGentic Explainer), an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanation-driven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanationdriven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.
翻译:大型语言模型(LLMs)已取得显著进展,但其内部机制在很大程度上仍不透明,这对其安全可靠部署构成了重大挑战。稀疏自编码器(SAEs)已成为将LLM表示分解为更具可解释性特征的有力工具,然而解释SAEs所捕获的特征仍然是一项艰巨的任务。本研究提出SAGE(SAE智能体解释框架),这是一种基于智能体的框架,将特征解释从被动的单次生成任务重构为主动的、由解释驱动的过程。SAGE通过系统性地为每个特征制定多种解释假设、设计针对性实验进行验证,并基于经验性激活反馈迭代优化解释,从而实现了严谨的方法论。在多种语言模型的SAE特征上进行的实验表明,与现有最先进的基线方法相比,SAGE生成的解释在生成准确性和预测准确性方面均显著提升。