Hallucinations pose a significant challenge to the reliability and alignment of Large Language Models (LLMs), limiting their widespread acceptance beyond chatbot applications. Despite ongoing efforts, hallucinations remain a prevalent challenge in LLMs. The detection of hallucinations itself is also a formidable task, frequently requiring manual labeling or constrained evaluations. This paper introduces an automated scalable framework that combines benchmarking LLMs' hallucination tendencies with efficient hallucination detection. We leverage LLMs to generate challenging tasks related to hypothetical phenomena, subsequently employing them as agents for efficient hallucination detection. The framework is domain-agnostic, allowing the use of any language model for benchmark creation or evaluation in any domain. We introduce the publicly available HypoTermQA Benchmarking Dataset, on which state-of-the-art models' performance ranged between 3% and 11%, and evaluator agents demonstrated a 6% error rate in hallucination prediction. The proposed framework provides opportunities to test and improve LLMs. Additionally, it has the potential to generate benchmarking datasets tailored to specific domains, such as law, health, and finance.
翻译:幻觉对大语言模型(LLMs)的可靠性和对齐性构成重大挑战,限制了其在聊天机器人应用之外的广泛接受度。尽管持续努力,幻觉仍是LLMs面临的普遍难题。幻觉检测本身也是一项艰巨任务,通常需要人工标注或受限评估。本文提出了一种自动化可扩展框架,将LLMs的幻觉倾向基准测试与高效幻觉检测相结合。我们利用LLMs生成与假设现象相关的挑战性任务,随后将其作为智能体进行高效幻觉检测。该框架具有领域无关性,允许在任意领域使用任意语言模型进行基准创建或评估。我们发布了公开可用的HypoTermQA基准测试数据集,在该数据集上,最先进模型的表现准确率介于3%至11%之间,评估智能体在幻觉预测中的错误率为6%。所提出的框架为测试和改进LLMs提供了机会,同时具备生成面向特定领域(如法律、健康、金融)定制基准测试数据集的潜力。