Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

Large language models (LLMs) have shown remarkable performance on a variety of NLP tasks, and are being rapidly adopted in a wide range of use cases. It is therefore of vital importance to holistically evaluate the factuality of their generated outputs, as hallucinations remain a challenging issue. In this work, we focus on assessing LLMs' ability to recall factual knowledge learned from pretraining, and the factors that affect this ability. To that end, we construct FACT-BENCH, a representative benchmark covering 20 domains, 134 property types, 3 answer types, and different knowledge popularity levels. We benchmark 31 models from 10 model families and provide a holistic assessment of their strengths and weaknesses. We observe that instruction-tuning hurts knowledge recall, as pretraining-only models consistently outperform their instruction-tuned counterparts, and positive effects of model scaling, as larger models outperform smaller ones for all model families. However, the best performance from GPT-4 still represents a large gap with the upper-bound. We additionally study the role of in-context exemplars using counterfactual demonstrations, which lead to significant degradation of factual knowledge recall for large models. By further decoupling model known and unknown knowledge, we find the degradation is attributed to exemplars that contradict a model's known knowledge, as well as the number of such exemplars. Lastly, we fine-tune LLaMA-7B in different settings of known and unknown knowledge. In particular, fine-tuning on a model's known knowledge is beneficial, and consistently outperforms fine-tuning on unknown and mixed knowledge. We will make our benchmark publicly available.

翻译：大型语言模型（LLMs）在多种自然语言处理任务中展现出卓越性能，正被迅速应用于广泛场景。由于其输出内容中仍存在幻觉这一挑战性问题，对其生成内容事实性进行全面评估至关重要。本文聚焦于评估LLMs从预训练中习得的事实知识检索能力及其影响因素。为此，我们构建了FACT-BENCH基准测试集，覆盖20个领域、134种属性类型、3种答案类型及不同知识流行度等级。我们对来自10个模型家族的31个模型进行基准测试，系统评估其优势与不足。实验发现：指令微调会削弱知识检索能力——仅经预训练的模型始终优于经指令微调的对应模型；模型扩展具有正向效应——所有模型家族中较大规模模型表现更优。但即便是表现最佳的GPT-4，与上界仍存在显著差距。我们进一步通过反事实示例研究上下文样本的作用，发现这类示例会显著降低大模型的事实知识检索能力。通过解耦模型已知知识与未知知识发现，性能下降归因于与模型已知知识相矛盾的示例及其数量。最后，我们在不同已知/未知知识设置下对LLaMA-7B进行微调：特定情况下，基于模型已知知识的微调效果最佳，且持续优于基于未知或混合知识的微调。我们将公开发布该基准测试集。