Large language models (LLMs) often generate content with unsupported or unverifiable content, known as "hallucinations." To address this, retrieval-augmented LLMs are employed to include citations in their content, grounding the content in verifiable sources. Despite such developments, manually assessing how well a citation supports the associated statement remains a major challenge. Previous studies tackle this challenge by leveraging faithfulness metrics to estimate citation support automatically. However, they limit this citation support estimation to a binary classification scenario, neglecting fine-grained citation support in practical scenarios. To investigate the effectiveness of faithfulness metrics in fine-grained scenarios, we propose a comparative evaluation framework that assesses the metric effectiveness in distinguishing citations between three-category support levels: full, partial, and no support. Our framework employs correlation analysis, classification evaluation, and retrieval evaluation to measure the alignment between metric scores and human judgments comprehensively. Our results indicate no single metric consistently excels across all evaluations, highlighting the complexity of accurately evaluating fine-grained support levels. Particularly, we find that the best-performing metrics struggle to distinguish partial support from full or no support. Based on these findings, we provide practical recommendations for developing more effective metrics.
翻译:大型语言模型(LLM)常生成包含未经支持或无法验证内容的文本,即所谓“幻觉”。为解决此问题,检索增强型LLM被用于在其生成内容中添加引文,使内容基于可验证的来源。尽管有这些进展,手动评估引文对相关陈述的支持程度仍是一个主要挑战。先前研究通过利用忠实性度量自动估计引文支持来应对这一挑战,但仅将引文支持估计局限于二元分类场景,忽略了实际场景中的细粒度引文支持。为探究忠实性度量在细粒度场景中的有效性,我们提出了一个对比评估框架,用于评估度量在区分三类支持级别(完全支持、部分支持、无支持)的引文时的有效性。该框架采用相关性分析、分类评估和检索评估,全面衡量度量分数与人类判断之间的一致性。我们的结果表明,没有单一度量能在所有评估中持续表现优异,这凸显了准确评估细粒度支持级别的复杂性。特别是,我们发现表现最佳的度量难以区分部分支持与完全支持或无支持。基于这些发现,我们为开发更有效的度量提供了实用建议。