Interpretability methods seek to understand language model representations, yet the outputs of most such methods -- circuits, vectors, scalars -- are not immediately human-interpretable. In response, we introduce LatentQA, the task of answering open-ended questions about model activations in natural language. Towards solving LatentQA, we propose Latent Interpretation Tuning (LIT), which finetunes a decoder LLM on a dataset of activations and associated question-answer pairs, similar to how visual instruction tuning trains on question-answer pairs associated with images. We use the decoder for diverse reading applications, such as extracting relational knowledge from representations or uncovering system prompts governing model behavior. Our decoder also specifies a differentiable loss that we use to control models, such as debiasing models on stereotyped sentences and controlling the sentiment of generations. Finally, we extend LatentQA to reveal harmful model capabilities, such as generating recipes for bioweapons and code for hacking.
翻译:可解释性方法旨在理解语言模型的表征,然而大多数此类方法的输出——如电路、向量、标量——并非直接可被人理解。为此,我们提出了LatentQA任务,即以自然语言回答关于模型激活值的开放式问题。为求解LatentQA,我们提出了潜在解释微调(Latent Interpretation Tuning, LIT),该方法通过在激活值及相关问答对组成的数据集上微调解码器大型语言模型来实现,其原理类似于视觉指令微调在图像关联问答对上进行训练。我们将该解码器应用于多种解读任务,例如从表征中提取关系知识,或揭示支配模型行为的系统提示。该解码器还定义了一个可微损失函数,我们借此实现对模型的控制,例如在刻板化语句上对模型进行去偏处理,以及控制生成文本的情感倾向。最后,我们将LatentQA扩展至揭示模型的有害能力,例如生成生物武器配方与黑客攻击代码。