Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs

Hallucinations in large language models remain a persistent challenge, particularly in multilingual and generative settings where factual consistency is difficult to maintain. While recent models show strong performance on English-centric benchmarks, their behavior across languages, tasks, and hallucination types is not yet well understood. In this work, we introduce Halluverse-M^3, a dataset designed to enable systematic analysis of hallucinations across multiple languages, multiple generation tasks, and multiple hallucination categories. Halluverse-M^3 covers four languages, English, Arabic, Hindi, and Turkish, and supports two generation tasks: question answering and dialogue summarization. The dataset explicitly distinguishes between entity-level, relation-level, and sentence-level hallucinations. Hallucinated outputs are constructed through a controlled editing process and validated by human annotators, ensuring clear alignment between original content and hallucinated generations. Using this dataset, we evaluate a diverse set of contemporary open-source and proprietary language models on fine-grained hallucination detection. Our results show that question answering is consistently easier than dialogue summarization, while sentence-level hallucinations remain challenging even for the strongest models. Performance is highest in English and degrades in lower-resource languages, with Hindi exhibiting the lowest detection accuracy. Overall, Halluverse-M^3 provides a realistic and challenging benchmark for studying hallucinations in multilingual, multi-task settings. We release the dataset to support future research on hallucination detection and mitigation\footnote{https://huggingface.co/datasets/sabdalja/HalluVerse-M3}.

翻译：大语言模型中的幻觉问题依然是一个持续存在的挑战，尤其在需要保持事实一致性的多语言和生成式场景中更为突出。尽管近期模型在以英语为中心的基准测试中表现出色，但其在不同语言、任务和幻觉类型上的行为尚未得到充分理解。本文中，我们提出了Halluverse-M^3数据集，旨在支持对多语言、多生成任务和多幻觉类别下的幻觉进行系统性分析。Halluverse-M^3涵盖英语、阿拉伯语、印地语和土耳其语四种语言，支持问答和对话摘要两项生成任务。该数据集明确区分了实体级、关系级和句子级幻觉。幻觉输出通过受控编辑过程构建，并经过人工标注者验证，确保了原始内容与幻觉生成之间的清晰对应。利用该数据集，我们对一系列当代开源和专有语言模型进行了细粒度幻觉检测评估。结果表明，问答任务始终比对话摘要任务更容易，而句子级幻觉即使对于最强大的模型也依然具有挑战性。模型在英语上表现最佳，在资源较少的语言上性能下降，其中印地语的检测准确率最低。总体而言，Halluverse-M^3为研究多语言、多任务环境下的幻觉问题提供了一个现实且具有挑战性的基准。我们公开此数据集以支持未来关于幻觉检测与缓解的研究\footnote{https://huggingface.co/datasets/sabdalja/HalluVerse-M3}。