How Much Do LLMs Hallucinate across Languages? On Realistic Multilingual Estimation of LLM Hallucination

In the age of misinformation, hallucination - the tendency of Large Language Models (LLMs) to generate non-factual or unfaithful responses - represents the main risk for their global utility. Despite LLMs becoming increasingly multilingual, the vast majority of research on detecting and quantifying LLM hallucination are (a) English-centric and (b) focus on machine translation (MT) and summarization, tasks that are less common in realistic settings than open information seeking. In contrast, we aim to quantify the extent of LLM hallucination across languages in knowledge-intensive long-form question answering (LFQA). To this end, we train a multilingual hallucination detection model and conduct a large-scale study across 30 languages and 6 open-source LLM families. We start from an English hallucination detection dataset and rely on MT to translate-train a detection model. We also manually annotate gold data for five high-resource languages; we then demonstrate, for these languages, that the estimates of hallucination rates are similar between silver (LLM-generated) and gold test sets, validating the use of silver data for estimating hallucination rates for other languages. For the final rates estimation, we build open-domain QA dataset for 30 languages with LLM-generated prompts and Wikipedia articles as references. Our analysis shows that LLMs, in absolute terms, hallucinate more tokens in high-resource languages due to longer responses, but that the actual hallucination rates (i.e., normalized for length) seems uncorrelated with the sizes of languages' digital footprints. We also find that smaller LLMs hallucinate more, and significantly, LLMs with broader language support display higher hallucination rates.

翻译：在错误信息泛滥的时代，幻觉——即大型语言模型（LLMs）生成非事实或不忠实回应的倾向——构成了其全球实用性的主要风险。尽管LLMs正日益变得多语言化，但绝大多数关于检测和量化LLM幻觉的研究都（a）以英语为中心，且（b）集中于机器翻译（MT）和摘要生成这类在现实场景中不如开放信息检索常见的任务。相比之下，我们旨在量化LLMs在知识密集型长形式问答（LFQA）中跨语言的幻觉程度。为此，我们训练了一个多语言幻觉检测模型，并在30种语言和6个开源LLM家族中进行了大规模研究。我们从英语幻觉检测数据集出发，依靠机器翻译来训练一个翻译-训练检测模型。我们还为五种高资源语言手动标注了黄金标准数据；随后我们证明，对于这些语言，在白银（LLM生成）测试集和黄金测试集上估计的幻觉率是相似的，这验证了使用白银数据来估计其他语言幻觉率的有效性。对于最终的比率估计，我们为30种语言构建了开放域问答数据集，其中包含LLM生成的提示和以维基百科文章作为参考。我们的分析表明，从绝对数量上看，LLMs在高资源语言中因生成长度更长的回应而产生更多幻觉标记，但实际的幻觉率（即经过长度归一化后）似乎与语言数字足迹的大小无关。我们还发现，较小的LLMs产生更多幻觉，且值得注意的是，支持语言更广泛的LLMs显示出更高的幻觉率。