This paper presents a comparative analysis of hallucination detection systems for AI, focusing on automatic summarization and question answering tasks for Large Language Models (LLMs). We evaluate different hallucination detection systems using the diagnostic odds ratio (DOR) and cost-effectiveness metrics. Our results indicate that although advanced models can perform better they come at a much higher cost. We also demonstrate how an ideal hallucination detection system needs to maintain performance across different model sizes. Our findings highlight the importance of choosing a detection system aligned with specific application needs and resource constraints. Future research will explore hybrid systems and automated identification of underperforming components to enhance AI reliability and efficiency in detecting and mitigating hallucinations.
翻译:本文对人工智能幻觉检测系统进行了比较分析,重点关注大型语言模型(LLMs)在自动摘要和问答任务中的表现。我们采用诊断比值比(DOR)和成本效益指标评估了不同的幻觉检测系统。结果表明,尽管先进模型可能表现更佳,但其成本也显著更高。我们还论证了理想的幻觉检测系统需在不同模型规模下保持性能稳定性。本研究强调了根据具体应用需求与资源限制选择合适的检测系统的重要性。未来研究将探索混合系统及对低效组件的自动识别方法,以提升人工智能在检测与缓解幻觉方面的可靠性与效率。