In this paper, we establish a benchmark named HalluQA (Chinese Hallucination Question-Answering) to measure the hallucination phenomenon in Chinese large language models. HalluQA contains 450 meticulously designed adversarial questions, spanning multiple domains, and takes into account Chinese historical culture, customs, and social phenomena. During the construction of HalluQA, we consider two types of hallucinations: imitative falsehoods and factual errors, and we construct adversarial samples based on GLM-130B and ChatGPT. For evaluation, we design an automated evaluation method using GPT-4 to judge whether a model output is hallucinated. We conduct extensive experiments on 24 large language models, including ERNIE-Bot, Baichuan2, ChatGLM, Qwen, SparkDesk and etc. Out of the 24 models, 18 achieved non-hallucination rates lower than 50%. This indicates that HalluQA is highly challenging. We analyze the primary types of hallucinations in different types of models and their causes. Additionally, we discuss which types of hallucinations should be prioritized for different types of models.
翻译:本文构建了一个名为HalluQA(中文幻觉问答)的基准测试集,用于评估中文大语言模型中的幻觉现象。HalluQA包含450个精心设计的对抗性问题,涵盖多个领域,并充分考虑中文历史文化、社会习俗及社会现象。在构建HalluQA的过程中,我们考虑了两类幻觉:模仿性谬误和事实性错误,并基于GLM-130B和ChatGPT构建了对抗样本。在评估方面,我们设计了基于GPT-4的自动化评估方法,用于判断模型输出是否包含幻觉。我们对24个大语言模型进行了大量实验,包括ERNIE-Bot、Baichuan2、ChatGLM、Qwen、SparkDesk等。在这24个模型中,有18个模型的非幻觉率低于50%。这表明HalluQA具有很高的挑战性。我们分析了不同类型模型中幻觉的主要类型及其成因。此外,我们还探讨了针对不同类型模型应优先关注哪些类型的幻觉问题。