Large language models (LLMs) often produce unsupported or unverifiable information, known as "hallucinations." To mitigate this, retrieval-augmented LLMs incorporate citations, grounding the content in verifiable sources. Despite such developments, manually assessing how well a citation supports the associated statement remains a major challenge. Previous studies use faithfulness metrics to estimate citation support automatically but are limited to binary classification, overlooking fine-grained citation support in practical scenarios. To investigate the effectiveness of faithfulness metrics in fine-grained scenarios, we propose a comparative evaluation framework that assesses the metric effectiveness in distinguishinging citations between three-category support levels: full, partial, and no support. Our framework employs correlation analysis, classification evaluation, and retrieval evaluation to measure the alignment between metric scores and human judgments comprehensively. Our results show no single metric consistently excels across all evaluations, revealing the complexity of assessing fine-grained support. Based on the findings, we provide practical recommendations for developing more effective metrics.
翻译:大型语言模型(LLM)常产生缺乏依据或无法验证的信息,即“幻觉”。为缓解此问题,检索增强型LLM引入了引用机制,将生成内容锚定于可验证的来源。尽管有此进展,人工评估引用对相关陈述的支持程度仍面临重大挑战。先前研究采用忠实度指标自动估计引用支持,但仅限于二元分类,忽略了实际场景中的细粒度引用支持。为探究忠实度指标在细粒度场景中的有效性,我们提出一个对比评估框架,该框架评估指标在区分三类支持级别(完全支持、部分支持、无支持)引用时的效能。我们的框架综合运用相关性分析、分类评估与检索评估,全面衡量指标得分与人工判断的一致性。结果表明,没有单一指标能在所有评估中持续表现优异,这揭示了评估细粒度支持的复杂性。基于研究发现,我们为开发更有效的指标提供了实用建议。