What latent features are encoded in language model (LM) representations? Recent work on training sparse autoencoders (SAEs) to disentangle interpretable features in LM representations has shown significant promise. However, evaluating the quality of these SAEs is difficult because we lack a ground-truth collection of interpretable features that we expect good SAEs to recover. We thus propose to measure progress in interpretable dictionary learning by working in the setting of LMs trained on chess and Othello transcripts. These settings carry natural collections of interpretable features -- for example, "there is a knight on F3" -- which we leverage into $\textit{supervised}$ metrics for SAE quality. To guide progress in interpretable dictionary learning, we introduce a new SAE training technique, $\textit{p-annealing}$, which improves performance on prior unsupervised metrics as well as our new metrics.
翻译:语言模型(LM)表示中编码了哪些潜在特征?近期关于训练稀疏自编码器(SAE)以解耦LM表示中可解释特征的研究显示出显著潜力。然而,评估这些SAE的质量十分困难,因为我们缺乏一个期望优质SAE能够恢复的可解释特征的真实集合。为此,我们提出在基于国际象棋和奥赛罗棋谱训练的LM场景中,通过可解释词典学习的进展来衡量。这些场景天然具备可解释特征的集合——例如“F3格存在一个骑士”——我们将其转化为SAE质量的$\textit{有监督}$评估指标。为了引导可解释词典学习的进展,我们提出一种新的SAE训练技术$\textit{p-退火}$,该方法在现有无监督指标及我们提出的新指标上均表现出性能提升。