Hallucinations in large language models remain a persistent challenge, particularly in multilingual and generative settings where factual consistency is difficult to maintain. While recent models show strong performance on English-centric benchmarks, their behavior across languages, tasks, and hallucination types is not yet well understood. In this work, we introduce Halluverse-M^3, a dataset designed to enable systematic analysis of hallucinations across multiple languages, multiple generation tasks, and multiple hallucination categories. Halluverse-M^3 covers four languages, English, Arabic, Hindi, and Turkish, and supports two generation tasks: question answering and dialogue summarization. The dataset explicitly distinguishes between entity-level, relation-level, and sentence-level hallucinations. Hallucinated outputs are constructed through a controlled editing process and validated by human annotators, ensuring clear alignment between original content and hallucinated generations. Using this dataset, we evaluate a diverse set of contemporary open-source and proprietary language models on fine-grained hallucination detection. Our results show that question answering is consistently easier than dialogue summarization, while sentence-level hallucinations remain challenging even for the strongest models. Performance is highest in English and degrades in lower-resource languages, with Hindi exhibiting the lowest detection accuracy. Overall, Halluverse-M^3 provides a realistic and challenging benchmark for studying hallucinations in multilingual, multi-task settings. We release the dataset to support future research on hallucination detection and mitigation\footnote{https://huggingface.co/datasets/sabdalja/HalluVerse-M3}.
翻译:大语言模型中的幻觉问题依然是一个持续存在的挑战,尤其在需要保持事实一致性的多语言和生成式场景中更为突出。尽管近期模型在以英语为中心的基准测试中表现出色,但其在不同语言、任务和幻觉类型上的行为尚未得到充分理解。本文中,我们提出了Halluverse-M^3数据集,旨在支持对多语言、多生成任务和多幻觉类别下的幻觉进行系统性分析。Halluverse-M^3涵盖英语、阿拉伯语、印地语和土耳其语四种语言,支持问答和对话摘要两项生成任务。该数据集明确区分了实体级、关系级和句子级幻觉。幻觉输出通过受控编辑过程构建,并经过人工标注者验证,确保了原始内容与幻觉生成之间的清晰对应。利用该数据集,我们对一系列当代开源和专有语言模型进行了细粒度幻觉检测评估。结果表明,问答任务始终比对话摘要任务更容易,而句子级幻觉即使对于最强大的模型也依然具有挑战性。模型在英语上表现最佳,在资源较少的语言上性能下降,其中印地语的检测准确率最低。总体而言,Halluverse-M^3为研究多语言、多任务环境下的幻觉问题提供了一个现实且具有挑战性的基准。我们公开此数据集以支持未来关于幻觉检测与缓解的研究\footnote{https://huggingface.co/datasets/sabdalja/HalluVerse-M3}。