Top-down transparency typically analyzes language model activations using probes with scalar or single-token outputs, limiting the range of behaviors that can be captured. To alleviate this issue, we develop a more expressive probe that can directly output natural language, performing LatentQA: the task of answering open-ended questions about activations. A key difficulty in developing such a probe is collecting a dataset mapping activations to natural-language descriptions. In response, we propose an approach for generating a dataset of activations and associated question-answer pairs and develop a fine-tuning method for training a decoder LLM on this dataset. We then validate our decoder's fidelity by assessing its ability to read and control model activations. First, we evaluate the decoder on a number of supervised reading tasks with a known answer, such as uncovering hidden system prompts and relational knowledge extraction, and observe that it outperforms competitive probing baselines. Second, we demonstrate that the decoder is precise enough to steer the target model to exhibit behaviors unseen during training. Finally, we show that LatentQA scales well with increasing dataset and model size.
翻译:自顶向下的可解释性通常利用探针分析语言模型激活值,但这些探针仅输出标量或单令牌,限制了可捕捉行为的范围。为解决这一问题,我们开发了一种更具表达力的探针,能直接输出自然语言,从而执行潜在质量评估任务:回答关于激活值的开放式问题。开发此类探针的关键难点在于收集从激活值到自然语言描述的数据集。为此,我们提出一种生成激活值及相关问答对数据集的方法,并开发了针对该数据集的解码器大语言模型微调方法。随后,通过评估解码器读取和控制模型激活值的能力来验证其保真度。首先,我们在多个已知答案的监督式读取任务(如揭示隐藏系统提示和关系知识提取)上评估解码器,发现其性能优于有竞争力的探针基线。其次,我们证明解码器精度足以引导目标模型展现训练中未出现的行为。最后,我们展示了潜在质量评估方法随数据集和模型规模扩大而具有良好的可扩展性。