UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation

Large language models (LLMs) have emerged as pivotal contributors in contemporary natural language processing and are increasingly being applied across a diverse range of industries. However, these large-scale probabilistic statistical models cannot currently ensure the requisite quality in professional content generation. These models often produce hallucinated text, compromising their practical utility in professional contexts. To assess the authentic reliability of LLMs in text generation, numerous initiatives have developed benchmark evaluations for hallucination phenomena. Nevertheless, these benchmarks frequently utilize constrained generation techniques due to cost and temporal constraints. These techniques encompass the use of directed hallucination induction and strategies that deliberately alter authentic text to produce hallucinations. These approaches are not congruent with the unrestricted text generation demanded by real-world applications. Furthermore, a well-established Chinese-language dataset dedicated to the evaluation of hallucinations in text generation is presently lacking. Consequently, we have developed an Unconstrained Hallucination Generation Evaluation (UHGEval) benchmark, designed to compile outputs produced with minimal restrictions by LLMs. Concurrently, we have established a comprehensive benchmark evaluation framework to aid subsequent researchers in undertaking scalable and reproducible experiments. We have also executed extensive experiments, evaluating prominent Chinese language models and the GPT series models to derive professional performance insights regarding hallucination challenges.

翻译：大语言模型已成为当代自然语言处理领域的关键贡献者，并正日益广泛应用于各行各业。然而，这些大规模概率统计模型目前尚无法确保在专业内容生成中达到所需的质量标准。这些模型常常产生幻觉文本，从而损害了其在专业场景中的实际效用。为评估大语言模型在文本生成中的真实可靠性，众多研究已针对幻觉现象开发了基准评测。然而，由于成本与时间限制，这些基准评测常采用约束生成技术。这些技术包括使用定向幻觉诱导策略，以及刻意篡改真实文本来产生幻觉的方法。这些方式与实际应用所要求的无约束文本生成模式并不相符。此外，目前尚缺乏专门用于评估文本生成幻觉的成熟中文数据集。为此，我们开发了无约束幻觉生成评估基准，旨在收集大语言模型在最小限制条件下产生的输出。同时，我们建立了一个全面的基准评测框架，以帮助后续研究者进行可扩展、可复现的实验。我们还进行了广泛的实验，评估了主流的中文语言模型以及GPT系列模型，以获取关于幻觉问题的专业性能洞察。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日