Professionals in academia, law, and finance audit their documents because inconsistencies can result in monetary, reputational, and scientific costs. Language models (LMs) have the potential to dramatically speed up this auditing process. To understand their abilities, we introduce a benchmark, FIND (Finding INconsistencies in Documents), where each example is a document with an inconsistency inserted manually by a domain expert. Despite the documents being long, technical, and complex, the best-performing model (gpt-5) recovered 64% of the inserted inconsistencies. Surprisingly, gpt-5 also found undiscovered inconsistencies present in the original documents. For example, on 50 arXiv papers, we judged 136 out of 196 of the model's suggestions to be legitimate inconsistencies missed by the original authors. However, despite these findings, even the best models miss almost half of the inconsistencies in FIND, demonstrating that inconsistency detection is still a challenging task.
翻译:学术界、法律界和金融界的专业人士会对文档进行审核,因为不一致性可能导致经济、声誉及科学层面的损失。语言模型(LMs)有望显著加速这一审核流程。为探究其能力,我们提出了一个基准测试FIND(文档不一致性检测),其中每个样本均为由领域专家手动插入不一致性的文档。尽管文档篇幅长、专业性强且结构复杂,性能最佳的模型(gpt-5)仍成功识别出64%的人工插入不一致性。令人惊讶的是,gpt-5还发现了原始文档中未被察觉的不一致性。例如,在50篇arXiv论文中,我们判定模型提出的196条建议中有136条属于原作者遗漏的合理不一致性。然而,尽管取得这些发现,即使最优模型仍漏检FIND中近半数的不一致性,这表明不一致性检测目前仍是一项具有挑战性的任务。