Recently, AI assistants based on large language models (LLMs) show surprising performance in many tasks, such as dialogue, solving math problems, writing code, and using tools. Although LLMs possess intensive world knowledge, they still make factual errors when facing some knowledge intensive tasks, like open-domain question answering. These untruthful responses from the AI assistant may cause significant risks in practical applications. We believe that an AI assistant's refusal to answer questions it does not know is a crucial method for reducing hallucinations and making the assistant truthful. Therefore, in this paper, we ask the question "Can AI assistants know what they don't know and express them through natural language?" To answer this question, we construct a model-specific "I don't know" (Idk) dataset for an assistant, which contains its known and unknown questions, based on existing open-domain question answering datasets. Then we align the assistant with its corresponding Idk dataset and observe whether it can refuse to answer its unknown questions after alignment. Experimental results show that after alignment with Idk datasets, the assistant can refuse to answer most its unknown questions. For questions they attempt to answer, the accuracy is significantly higher than before the alignment.
翻译:近来,基于大语言模型的AI助手在对话、数学问题求解、代码编写和工具使用等多项任务中展现出令人瞩目的性能。尽管大语言模型具备丰富的世界知识,但在处理开放域问答等知识密集型任务时仍会出现事实性错误。AI助手这类不真实的回答可能在实际应用中引发重大风险。我们认为,AI助手拒绝对其未知问题进行回答是减少幻觉、提升真实性的关键方法。为此,本文提出"AI助手能否知道自己不知道什么,并通过自然语言表达这种认知?"这一研究问题。为回答该问题,我们基于现有开放域问答数据集,为特定AI助手构建了模型专属的"我不知道"数据集,其中包含该助手已知与未知的问题。随后,我们通过该数据集对助手进行对齐,并观察对齐后助手能否拒绝对其未知问题进行回答。实验结果表明,经过"I不知道"数据集对齐后,AI助手能够拒绝对绝大多数未知问题进行回答。对于其尝试作答的问题,回答准确率显著高于对齐前。