Large language models (LLMs) outperform information retrieval techniques for downstream knowledge-intensive tasks when being prompted to generate world knowledge. However, community concerns abound regarding the factuality and potential implications of using this uncensored knowledge. In light of this, we introduce CONNER, a COmpreheNsive kNowledge Evaluation fRamework, designed to systematically and automatically evaluate generated knowledge from six important perspectives -- Factuality, Relevance, Coherence, Informativeness, Helpfulness and Validity. We conduct an extensive empirical analysis of the generated knowledge from three different types of LLMs on two widely studied knowledge-intensive tasks, i.e., open-domain question answering and knowledge-grounded dialogue. Surprisingly, our study reveals that the factuality of generated knowledge, even if lower, does not significantly hinder downstream tasks. Instead, the relevance and coherence of the outputs are more important than small factual mistakes. Further, we show how to use CONNER to improve knowledge-intensive tasks by designing two strategies: Prompt Engineering and Knowledge Selection. Our evaluation code and LLM-generated knowledge with human annotations will be released to facilitate future research.
翻译:大型语言模型(LLMs)在被提示生成世界知识时,在面向下游知识密集型任务中优于信息检索技术。然而,社区对使用这种未经审查的知识的事实性及潜在影响存在广泛担忧。鉴于此,我们提出了CONNER——一个综合性知识评估框架,旨在从六个重要维度系统且自动地评估生成的知识:事实性、相关性、连贯性、信息量、有用性和有效性。我们对三种不同类型LLM生成的知识在两个广泛研究的知识密集型任务(即开放域问答和知识驱动对话)上进行了广泛的实证分析。令人惊讶的是,我们的研究表明,生成知识的事实性即使较低,也不会显著阻碍下游任务。相反,输出的相关性和连贯性比微小的事实错误更为重要。此外,我们展示了如何通过设计两种策略——提示工程和知识选择——使用CONNER来改进知识密集型任务。我们的评估代码及带有人工标注的LLM生成知识将被公开,以促进未来研究。