Hallucination plagues even frontier LLMs--but how bad is it really for summarizing academic papers? We evaluate Factored Verification, a simple automated method for detecting hallucinations in abstractive summaries. This method sets a new SotA on hallucination detection in the summarization task of the HaluEval benchmark, achieving 76.2% accuracy. We then use this method to estimate how often language models hallucinate when summarizing across multiple academic papers and find 0.62 hallucinations in the average ChatGPT (16k) summary, 0.84 for GPT-4, and 1.55 for Claude 2. We ask models to self-correct using Factored Critiques and find that this lowers the number of hallucinations to 0.49 for ChatGPT, 0.46 for GPT-4, and 0.95 for Claude 2. The hallucinations we find are often subtle, so we advise caution when using models to synthesize academic papers.
翻译:幻觉问题甚至困扰着前沿大语言模型——但它在学术论文摘要任务中实际有多严重?我们评估了分解验证(Factored Verification),这是一种用于检测抽象式摘要中幻觉的简单自动化方法。该方法在HaluEval基准测试的摘要任务上取得了幻觉检测的新最优结果,准确率达76.2%。随后,我们利用该方法估算了语言模型在总结多篇学术论文时产生幻觉的频率,发现平均每篇ChatGPT(16k)摘要存在0.62个幻觉,GPT-4为0.84个,Claude 2为1.55个。我们要求模型使用分解评论(Factored Critiques)进行自我修正,发现此举将ChatGPT的幻觉数降至0.49个,GPT-4降至0.46个,Claude 2降至0.95个。我们发现的幻觉往往十分细微,因此建议在使用模型综合学术论文时保持谨慎。