While large language models (LLMs) have proven to be effective on a large variety of tasks, they are also known to hallucinate information. To measure whether an LLM prefers factually consistent continuations of its input, we propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization. Specifically, our benchmark involves comparing the scores an LLM assigns to a factually consistent versus a factually inconsistent summary for an input news article. For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent. To generate summaries that are factually inconsistent, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent. A model's factual consistency is then measured according to its accuracy, i.e.\ the proportion of documents where it assigns a higher score to the factually consistent summary. To validate the usefulness of FIB, we evaluate 23 large language models ranging from 1B to 176B parameters from six different model families including BLOOM and OPT. We find that existing LLMs generally assign a higher score to factually consistent summaries than to factually inconsistent summaries. However, if the factually inconsistent summaries occur verbatim in the document, then LLMs assign a higher score to these factually inconsistent summaries than factually consistent summaries. We validate design choices in our benchmark including the scoring method and source of distractor summaries. Our code and benchmark data can be found at https://github.com/r-three/fib.
翻译:尽管大语言模型(LLMs)在众多任务上表现出色,但其生成虚假信息(即"幻觉"现象)的问题也备受关注。为衡量LLM是否倾向于生成与输入内容事实一致的续写,我们提出名为FIB(事实不一致基准测试)的新基准,聚焦摘要任务。具体而言,该基准通过比较LLM对同一输入新闻文章的事实一致性摘要与事实不一致摘要的评分来实现评估。事实一致性摘要采用经人工验证的事实一致的人写参考摘要;事实不一致摘要则选取来自多个摘要模型生成且经人工标注为事实不一致的摘要。模型的事实一致性通过其准确性衡量,即模型对事实一致性摘要给出更高评分的文档比例。为验证FIB的有效性,我们评估了来自BLOOM、OPT等六个模型家族的23个大语言模型(参数量从1B到176B不等)。研究发现:现有LLM通常对事实一致性摘要的评分高于事实不一致摘要;但当事实不一致摘要与原文逐字重复时,LLM反而会对这些事实不一致摘要给出更高评分。我们还验证了基准中的评分方法及干扰项摘要来源等设计选择。相关代码与基准数据详见https://github.com/r-three/fib。