Large Language Models (LLMs) are increasingly explored as knowledge bases (KBs), yet current evaluation methods focus too narrowly on knowledge retention, overlooking other crucial criteria for reliable performance. In this work, we rethink the requirements for evaluating reliable LLM-as-KB usage and highlight two essential factors: factuality, ensuring accurate responses to seen and unseen knowledge, and consistency, maintaining stable answers to questions about the same knowledge. We introduce UnseenQA, a dataset designed to assess LLM performance on unseen knowledge, and propose new criteria and metrics to quantify factuality and consistency, leading to a final reliability score. Our experiments on 26 LLMs reveal several challenges regarding their use as KBs, underscoring the need for more principled and comprehensive evaluation.
翻译:大型语言模型(LLMs)正日益被探索作为知识库(KBs)使用,然而当前的评估方法过于狭隘地关注知识保留能力,忽视了可靠性能所需的其他关键标准。在本研究中,我们重新思考了评估可靠LLM即知识库应用的要求,并强调两个核心因素:事实性——确保对已见和未见知识均能给出准确回答,以及一致性——保持对同一知识相关问题的回答稳定性。我们提出了UnseenQA数据集,专门用于评估LLM在未见知识上的表现,并设计了新的标准与度量指标来量化事实性与一致性,最终形成可靠性综合评分。通过对26个LLM的实验,我们揭示了其作为知识库使用时面临的若干挑战,强调了建立更系统化、更全面评估体系的必要性。