Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS' BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it's not possible to determine whether the model lied from the transcript alone. Overall, LIARS' BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.
翻译:先前的研究已提出检测大型语言模型(LLMs)说谎的技术,即识别模型生成其自认为虚假的陈述。然而,这些技术通常在狭窄的场景中进行验证,未能涵盖LLMs可能生成的各种谎言类型。我们引入了LIARS' BENCH,这是一个包含72,863个谎言与诚实回应的测试基准,由四个开源权重模型在七个数据集上生成。我们的测试场景捕捉了性质不同的谎言类型,并沿两个维度进行变化:模型说谎的动机,以及谎言所针对的信念对象。通过在LIARS' BENCH上评估三种黑盒与白盒谎言检测技术,我们发现现有技术系统性地无法识别特定类型的谎言,尤其是在仅凭对话文本无法判定模型是否说谎的场景中。总体而言,LIARS' BENCH揭示了现有技术的局限性,并为推动谎言检测研究提供了实用的测试基准。