Recent progress in natural language processing (NLP) owes much to remarkable advances in large language models (LLMs). Nevertheless, LLMs frequently "hallucinate," resulting in non-factual outputs. Our carefully-designed human evaluation substantiates the serious hallucination issue, revealing that even GPT-3.5 produces factual outputs less than 25% of the time. This underscores the importance of fact verifiers in order to measure and incentivize progress. Our systematic investigation affirms that LLMs can be repurposed as effective fact verifiers with strong correlations with human judgments. Surprisingly, FLAN-T5-11B, the least factual generator in our study, performs the best as a fact verifier, even outperforming more capable LLMs like GPT3.5 and ChatGPT. Delving deeper, we analyze the reliance of these LLMs on high-quality evidence, as well as their deficiencies in robustness and generalization ability. Our study presents insights for developing trustworthy generation models.
翻译:近期自然语言处理(NLP)领域的进展,很大程度上归功于大型语言模型(LLMs)的显著进步。然而,LLMs经常出现“幻觉”,导致输出内容不基于事实。我们精心设计的人工评估证实了这一严重的幻觉问题,揭示即使是GPT-3.5,其产生事实性输出的时间也不到25%。这凸显了事实验证器在衡量和激励进展中的重要性。我们的系统性研究确认,LLMs可以重新用作有效的事实验证器,其结果与人类判断高度相关。令人惊讶的是,FLAN-T5-11B——我们研究中最不擅长生成事实性内容的模型——作为事实验证器表现最佳,甚至超越了GPT-3.5和ChatGPT等能力更强的LLMs。深入研究后,我们分析了这些LLMs对高质量证据的依赖,以及它们在鲁棒性和泛化能力上的不足。我们的研究为开发可信赖的生成模型提供了见解。