An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the question. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Compared to vanilla chain of thought prompting (CoT), HoT reduces the rate of hallucination and separately improves LLM accuracy consistently on over 22 tasks from arithmetic, reading comprehension, to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to fool users into believing that an answer is correct.
翻译:大型语言模型(LLM)的一个致命弱点是其倾向于产生非事实性陈述。混杂事实与非事实陈述的回应对人类验证并准确做出决策构成了挑战。为解决这一问题,我们提出高亮思维链提示(HoT),该技术通过提示LLM生成带有XML标签的回应,将事实锚定至问题中提供的依据。具体而言,给定输入问题后,LLM首先会重新格式化问题并添加用于突出关键事实的XML标签,随后生成在输入事实引用处带有高亮标记的回应。与原始思维链提示(CoT)相比,HoT在超过22项任务(涵盖算术、阅读理解至逻辑推理)中持续降低了幻觉发生率并分别提升了LLM的准确率。当要求人类验证LLM回应时,高亮标记能帮助时间有限的参与者更准确高效地识别LLM的正确性。然而令人意外的是,当LLM出现错误时,HoT反而容易误导用户相信错误答案为正确。