How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.

翻译：大型语言模型（LLM）能够“撒谎”——我们将其定义为：在可证明意义上“知晓”真相的情况下，仍输出虚假陈述的行为。例如，当被要求输出错误信息时，LLM可能会“撒谎”。本文开发了一种简单的谎言检测方法，既无需访问LLM的激活状态（黑盒条件），也无需掌握相关事实的真实性。该检测器通过在被怀疑的谎言后提出一组预定义的无关追问，并将LLM的“是/否”回答输入逻辑回归分类器进行工作。尽管方法简单，这一谎言检测器却表现出高准确性和惊人的泛化能力。当使用单一场景（即提示GPT-3.5就事实性问题撒谎）的样本训练后，该检测器可跨分布泛化至：(1)其他LLM架构，(2)经微调后撒谎的LLM，(3)谄媚性谎言，以及(4)现实场景（如销售）中涌现的谎言。这些结果表明，LLM具有独特的与谎言相关的行为模式，且该模式在不同架构和场景中保持一致，这为通用谎言检测提供了可能。

相关内容

黑盒

关注 1

在科学，计算和工程学中，黑盒是一种设备，系统或对象，可以根据其输入和输出（或传输特性）对其进行查看，而无需对其内部工作有任何了解。它的实现是“不透明的”（黑色）。几乎任何事物都可以被称为黑盒：晶体管，引擎，算法，人脑，机构或政府。为了使用典型的“黑匣子方法”来分析建模为开放系统的事物，仅考虑刺激/响应的行为，以推断（未知）盒子。该黑匣子系统的通常表示形式是在该方框中居中的数据流程图。黑盒的对立面是一个内部组件或逻辑可用于检查的系统，通常将其称为白盒（有时也称为“透明盒”或“玻璃盒”）。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日