Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.
翻译:大型语言模型(LLM)能够“撒谎”——我们将其定义为:在可证明意义上“知晓”真相的情况下,仍输出虚假陈述的行为。例如,当被要求输出错误信息时,LLM可能会“撒谎”。本文开发了一种简单的谎言检测方法,既无需访问LLM的激活状态(黑盒条件),也无需掌握相关事实的真实性。该检测器通过在被怀疑的谎言后提出一组预定义的无关追问,并将LLM的“是/否”回答输入逻辑回归分类器进行工作。尽管方法简单,这一谎言检测器却表现出高准确性和惊人的泛化能力。当使用单一场景(即提示GPT-3.5就事实性问题撒谎)的样本训练后,该检测器可跨分布泛化至:(1)其他LLM架构,(2)经微调后撒谎的LLM,(3)谄媚性谎言,以及(4)现实场景(如销售)中涌现的谎言。这些结果表明,LLM具有独特的与谎言相关的行为模式,且该模式在不同架构和场景中保持一致,这为通用谎言检测提供了可能。