Large language models (LLMs) have emerged as pivotal contributors in contemporary natural language processing and are increasingly being applied across a diverse range of industries. However, these large-scale probabilistic statistical models cannot currently ensure the requisite quality in professional content generation. These models often produce hallucinated text, compromising their practical utility in professional contexts. To assess the authentic reliability of LLMs in text generation, numerous initiatives have developed benchmark evaluations for hallucination phenomena. Nevertheless, these benchmarks frequently utilize constrained generation techniques due to cost and temporal constraints. These techniques encompass the use of directed hallucination induction and strategies that deliberately alter authentic text to produce hallucinations. These approaches are not congruent with the unrestricted text generation demanded by real-world applications. Furthermore, a well-established Chinese-language dataset dedicated to the evaluation of hallucinations in text generation is presently lacking. Consequently, we have developed an Unconstrained Hallucination Generation Evaluation (UHGEval) benchmark, designed to compile outputs produced with minimal restrictions by LLMs. Concurrently, we have established a comprehensive benchmark evaluation framework to aid subsequent researchers in undertaking scalable and reproducible experiments. We have also executed extensive experiments, evaluating prominent Chinese language models and the GPT series models to derive professional performance insights regarding hallucination challenges.
翻译:大语言模型(LLMs)已成为当代自然语言处理领域的关键贡献者,并正广泛应用于各行各业。然而,这些大规模概率统计模型目前尚无法确保专业内容生成所需的品质。这些模型常产生幻觉文本,削弱了其在专业场景中的实际效用。为评估LLMs在文本生成中的真实可靠性,众多研究已开发出针对幻觉现象的基准评测。然而,由于成本和时间限制,这些基准常采用受约束生成技术,包括定向幻觉诱导策略及故意篡改真实文本以产生幻觉的方法。这些方式与真实应用场景所需的无约束文本生成并不一致。此外,目前缺乏完善的中文数据集专门用于评估文本生成中的幻觉问题。为此,我们构建了无约束幻觉生成评估(UHGEval)基准,旨在收集LLMs在最少限制条件下生成的输出。同时,我们建立了全面的基准测试框架,以帮助后续研究者开展可扩展和可复现的实验。我们还进行了大量实验,评估了主流中文语言模型及GPT系列模型,以获取关于幻觉挑战的专业性能洞察。