Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. Contrary to previous studies that primarily concentrate on the factuality of world knowledge (e.g.~information from Wikipedia), felm focuses on factuality across diverse domains, spanning from world knowledge to math and reasoning. Our annotation is based on text segments, which can help pinpoint specific factual errors. The factuality annotations are further supplemented by predefined error types and reference links that either support or contradict the statement. In our experiments, we investigate the performance of several LLM-based factuality evaluators on felm, including both vanilla LLMs and those augmented with retrieval mechanisms and chain-of-thought processes. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
翻译:评估大型语言模型(LLM)生成文本的事实性是一个新兴但至关重要的研究领域,旨在提醒用户潜在错误并指导更可靠LLM的开发。然而,评估事实性的评估器本身也需要适当的评估,以衡量进展并推动改进。这一方向尚未得到充分探索,严重阻碍了事实性评估器的进步。为解决这一问题,我们引入了一个针对大型语言模型事实性评估的基准测试,称为FELM。在该基准中,我们收集LLM生成的回答,并以细粒度方式标注事实性标签。与以往主要关注世界知识(例如维基百科信息)事实性的研究不同,FELM聚焦于跨领域的事实性,涵盖从世界知识到数学与推理的范畴。我们的标注基于文本片段,这有助于精准定位具体事实错误。事实性标注进一步通过预定义的错误类型以及支持或反驳陈述的参考链接进行补充。在实验中,我们研究了多个基于LLM的事实性评估器在FELM上的性能,包括原始LLM以及扩展了检索机制和链式推理过程的模型。我们的研究结果表明,尽管检索有助于事实性评估,但当前LLM在可靠检测事实错误方面仍远未达到令人满意的水平。