Poly-FEVER: A Multilingual Fact Verification Benchmark for Hallucination Detection in Large Language Models

Hallucinations in generative AI, particularly in Large Language Models (LLMs), pose a significant challenge to the reliability of multilingual applications. Existing benchmarks for hallucination detection focus primarily on English and a few widely spoken languages, lacking the breadth to assess inconsistencies in model performance across diverse linguistic contexts. To address this gap, we introduce Poly-FEVER, a large-scale multilingual fact verification benchmark specifically designed for evaluating hallucination detection in LLMs. Poly-FEVER comprises 77,973 labeled factual claims spanning 11 languages, sourced from FEVER, Climate-FEVER, and SciFact. It provides the first large-scale dataset tailored for analyzing hallucination patterns across languages, enabling systematic evaluation of LLMs such as ChatGPT and the LLaMA series. Our analysis reveals how topic distribution and web resource availability influence hallucination frequency, uncovering language-specific biases that impact model accuracy. By offering a multilingual benchmark for fact verification, Poly-FEVER facilitates cross-linguistic comparisons of hallucination detection and contributes to the development of more reliable, language-inclusive AI systems. The dataset is publicly available to advance research in responsible AI, fact-checking methodologies, and multilingual NLP, promoting greater transparency and robustness in LLM performance. The proposed Poly-FEVER is available at: https://huggingface.co/datasets/HanzhiZhang/Poly-FEVER.

翻译：生成式人工智能，特别是大语言模型中的幻觉问题，对多语言应用的可靠性构成了重大挑战。现有的幻觉检测基准主要集中于英语和少数几种广泛使用的语言，缺乏评估模型在不同语言语境下性能不一致性的广度。为填补这一空白，我们提出了Poly-FEVER，一个专为评估大语言模型幻觉检测而设计的大规模多语言事实核查基准。Poly-FEVER包含来自FEVER、Climate-FEVER和SciFact的77,973条带标注事实主张，涵盖11种语言。它提供了首个为分析跨语言幻觉模式而定制的大规模数据集，使得对ChatGPT及LLaMA系列等大语言模型进行系统性评估成为可能。我们的分析揭示了主题分布和网络资源可用性如何影响幻觉频率，并发现了影响模型准确性的语言特定偏见。通过提供一个用于事实核查的多语言基准，Poly-FEVER促进了幻觉检测的跨语言比较，并有助于开发更可靠、更具语言包容性的人工智能系统。该数据集已公开提供，以推动负责任人工智能、事实核查方法以及多语言自然语言处理的研究，从而提升大语言模型性能的透明度和鲁棒性。所提出的Poly-FEVER可通过以下链接获取：https://huggingface.co/datasets/HanzhiZhang/Poly-FEVER。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日