As Large Language Models (LLMs) become increasingly integrated into our daily lives, the potential harms from deceptive behavior underlie the need for faithfully interpreting their decision-making. While traditional probing methods have shown some effectiveness, they remain best for narrowly scoped tasks while more comprehensive explanations are still necessary. To this end, we investigate meta-models-an architecture using a "meta-model" that takes activations from an "input-model" and answers natural language questions about the input-model's behaviors. We evaluate the meta-model's ability to generalize by training them on selected task types and assessing their out-of-distribution performance in deceptive scenarios. Our findings show that meta-models generalize well to out-of-distribution tasks and point towards opportunities for future research in this area.
翻译:随着大语言模型日益融入日常生活,其欺骗性行为可能造成的危害凸显了忠实解释其决策过程的必要性。传统探测方法虽已展现一定效果,但主要适用于范围狭窄的任务,而更全面的解释机制仍有待建立。为此,我们研究元模型架构——该架构通过“元模型”接收“输入模型”的激活值,并以自然语言回答关于输入模型行为的问题。我们通过在选定任务类型上训练元模型,并评估其在欺骗性场景中的分布外泛化能力,检验了元模型的泛化性能。研究结果表明,元模型能有效泛化至分布外任务,并为该领域的未来研究指明了方向。