An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in few-shot settings, HoT outperforms vanilla chain of thought prompting (CoT) on a wide range of 17 tasks from arithmetic, reading comprehension to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to make users believe that an answer is correct.
翻译:大型语言模型(LLM)的一个致命弱点是其倾向于产生非事实性陈述。混杂事实与非事实陈述的回应给人类验证并据此准确决策带来了挑战。为解决此问题,我们提出了高亮思维链提示(HoT),该技术通过提示LLM生成带有XML标签的回应,将事实锚定至查询中提供的信息。具体而言,给定输入问题后,LLM会首先重新格式化问题并添加用于突出关键事实的XML标签,随后生成回应时对从输入中引用的事实进行高亮标注。值得注意的是,在少样本场景下,HoT在涵盖算术、阅读理解到逻辑推理的17项广泛任务中均优于原始思维链提示(CoT)。当要求人类验证LLM回应时,高亮标注能帮助时间有限的参与者更准确高效地识别LLM的正确回答。然而出人意料的是,当LLM出现错误时,HoT反而容易让用户相信错误答案是正确的。