Before deploying a language model (LM) within a given domain, it is important to measure its tendency to generate factually incorrect information in that domain. Existing factual generation evaluation methods focus on facts sampled from the LM itself, and thus do not control the set of evaluated facts and might under-represent rare and unlikely facts. We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality. FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements. We use our framework to create two benchmarks: Wiki-FACTOR and News-FACTOR. We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score correlates with perplexity, but the two metrics do not always agree on model ranking; and (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation, as measured by human annotators. We make our data and code publicly available in https://github.com/AI21Labs/factor.
翻译:在特定领域部署语言模型之前,评估其在该领域生成事实性错误信息的倾向至关重要。现有的事实性生成评估方法主要关注模型自身采样的生成事实,因此无法控制评估事实集合,并可能低估罕见事实。我们提出FACTOR(基于语料库转换的事实性评估),一种可扩展的语言模型事实性评估方法。FACTOR自动将目标事实语料库转化为评估基准,衡量模型生成语料库中真实事实与相似但错误陈述的倾向性。我们利用该框架创建了两个基准:Wiki-FACTOR和News-FACTOR。研究表明:(i)基准评分随模型规模增大而提升,且当语言模型通过检索增强时评分更优;(ii)基准评分与困惑度相关,但两种指标在模型排名上并不总是一致;(iii)当困惑度与基准评分存在分歧时,后者能更准确地反映开放式生成中的事实性(经人工标注验证)。相关数据和代码已开源至 https://github.com/AI21Labs/factor。