Large Language Models (LLMs) are able to provide assistance on a wide range of information-seeking tasks. However, model outputs may be misleading, whether unintentionally or in cases of intentional deception. We investigate the ability of LLMs to be deceptive in the context of providing assistance on a reading comprehension task, using LLMs as proxies for human users. We compare outcomes of (1) when the model is prompted to provide truthful assistance, (2) when it is prompted to be subtly misleading, and (3) when it is prompted to argue for an incorrect answer. Our experiments show that GPT-4 can effectively mislead both GPT-3.5-Turbo and GPT-4, with deceptive assistants resulting in up to a 23% drop in accuracy on the task compared to when a truthful assistant is used. We also find that providing the user model with additional context from the passage partially mitigates the influence of the deceptive model. This work highlights the ability of LLMs to produce misleading information and the effects this may have in real-world situations.
翻译:大型语言模型(LLMs)能够在广泛的信息检索任务中提供协助。然而,无论出于无意还是蓄意欺骗,模型输出都可能具有误导性。本研究以LLMs作为人类用户的代理,探究其在阅读理解任务中提供协助时的欺骗能力。我们比较了以下三种情况的结果:(1)模型被提示提供真实协助时;(2)被提示进行微妙误导时;(3)被提示为错误答案辩护时。实验表明,GPT-4能有效误导GPT-3.5-Turbo和GPT-4,与使用真实助手相比,欺骗性助手导致任务准确率下降高达23%。我们还发现,为用户模型提供文本的额外上下文信息能部分抵消欺骗性模型的影响。这项工作揭示了LLMs生成误导信息的能力及其在现实场景中可能造成的影响。