Hallucinations in Large Language Models (LLMs) pose a significant challenge, generating misleading or unverifiable content that undermines trust and reliability. Existing evaluation methods, such as KnowHalu, employ multi-stage verification but suffer from high computational costs. To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high detection accuracy. We conduct a comparative analysis of hallucination detection methods across various LLMs, evaluating True Positive Rate (TPR), True Negative Rate (TNR), and Accuracy on question-answering (QA) and summarization tasks. Our results show that HHEM reduces evaluation time from 8 hours to 10 minutes, while HHEM with non-fabrication checking achieves the highest accuracy \(82.2\%\) and TPR \(78.9\%\). However, HHEM struggles with localized hallucinations in summarization tasks. To address this, we introduce segment-based retrieval, improving detection by verifying smaller text components. Additionally, our cumulative distribution function (CDF) analysis indicates that larger models (7B-9B parameters) generally exhibit fewer hallucinations, while intermediate-sized models show higher instability. These findings highlight the need for structured evaluation frameworks that balance computational efficiency with robust factual validation, enhancing the reliability of LLM-generated content.
翻译:大型语言模型(LLMs)中的幻觉现象构成重大挑战,其生成具有误导性或无法验证的内容会损害模型的可靠性与可信度。现有评估方法(如KnowHalu)采用多阶段验证机制,但存在计算成本高昂的问题。为解决这一困境,我们引入休斯幻觉评估模型(HHEM)——一种轻量级分类框架,其运行不依赖基于LLM的判断,从而在保持高检测精度的同时显著提升效率。我们针对多种LLM进行了幻觉检测方法的比较分析,在问答(QA)与摘要生成任务中评估了真阳性率(TPR)、真阴性率(TNR)及准确率。实验结果表明:HHEM将评估时间从8小时缩短至10分钟,其中采用非虚构性检查的HHEM实现了最高准确率82.2%和最高TPR 78.9%。但HHEM在摘要生成任务中难以处理局部性幻觉。为此我们引入基于片段检索的改进方案,通过验证更细粒度的文本组件提升检测能力。此外,累积分布函数(CDF)分析显示,参数规模较大的模型(7B-9B参数)通常表现出更少的幻觉现象,而中等规模模型则呈现更高不稳定性。这些发现凸显了构建兼顾计算效率与稳健事实验证的结构化评估框架的必要性,从而增强LLM生成内容的可靠性。