Large language models (LLMs) demonstrate significant knowledge through their outputs, though it is often unclear whether false outputs are due to a lack of knowledge or dishonesty. In this paper, we investigate instructed dishonesty, wherein we explicitly prompt LLaMA-2-70b-chat to lie. We perform prompt engineering to find which prompts best induce lying behavior, and then use mechanistic interpretability approaches to localize where in the network this behavior occurs. Using linear probing and activation patching, we localize five layers that appear especially important for lying. We then find just 46 attention heads within these layers that enable us to causally intervene such that the lying model instead answers honestly. We show that these interventions work robustly across many prompts and dataset splits. Overall, our work contributes a greater understanding of dishonesty in LLMs so that we may hope to prevent it.
翻译:大型语言模型(LLMs)通过其输出展现出显著的知识储备,但虚假输出究竟源于知识匮乏还是不诚实行为,往往难以辨别。本文针对指令性不诚实行为展开研究,我们明确提示LLaMA-2-70b-chat模型进行撒谎。通过提示工程,我们筛选出最能诱发撒谎行为的提示模式;继而采用机制可解释性方法,定位该行为在神经网络中发生的区域。借助线性探测与激活修补技术,我们识别出五个对撒谎行为尤为关键的层。进一步在特定层内定位仅46个注意力头,通过因果干预可使撒谎模型转而诚实作答。实验证明,该干预措施对多种提示与数据集划分均具有稳健效果。总体而言,本研究深化了对LLMs不诚实行为的理解,为未来预防此类行为提供了可能。