FOLIO: Natural Language Reasoning with First-Order Logic

Simeng Han,Hailey Schoelkopf,Yilun Zhao,Zhenting Qi,Martin Riddell,Wenfei Zhou,James Coady,David Peng,Yujie Qiao,Luke Benson,Lucy Sun,Alex Wardle-Solano,Hannah Szabo,Ekaterina Zubova,Matthew Burtell,Jonathan Fan,Yixin Liu,Brian Wong,Malcolm Sailor,Ansong Ni,Linyong Nan,Jungo Kasai,Tao Yu,Rui Zhang,Alexander R. Fabbri,Wojciech Kryscinski,Semih Yavuz,Ye Liu,Xi Victoria Lin,Shafiq Joty,Yingbo Zhou,Caiming Xiong,Rex Ying,Arman Cohan,Dragomir Radev

Large language models (LLMs) have achieved remarkable performance on a variety of natural language understanding tasks. However, existing benchmarks are inadequate in measuring the complex logical reasoning capabilities of a model. We present FOLIO, a human-annotated, logically complex and diverse dataset for reasoning in natural language (NL), equipped with first-order logic (FOL) annotations. FOLIO consists of 1,430 examples (unique conclusions), each paired with one of 487 sets of premises used to deductively reason for the validity of each conclusion. The logical correctness of the premises and conclusions is ensured by their FOL annotations, which are automatically verified by an FOL inference engine. In addition to the main NL reasoning task, NL-FOL pairs in FOLIO constitute a new NL-FOL translation dataset. Our experiments on FOLIO systematically evaluate the FOL reasoning ability of supervised fine-tuning on medium-sized language models. For both NL reasoning and NL-FOL translation, we benchmark multiple state-of-the-art language models. Our results show that a subset of FOLIO presents a challenge for one of the most capable {Large Language Model (LLM)} publicly available, GPT-4.

翻译：大语言模型在多种自然语言理解任务中取得了显著性能。然而，现有基准在衡量模型复杂逻辑推理能力方面存在不足。我们提出FOLIO——一个经过人工标注、逻辑复杂且多样化的自然语言推理数据集，并配备一阶逻辑标注。FOLIO包含1,430个示例（独立结论），每个示例对应487组前提集合之一，用于演绎推理以验证每个结论的有效性。前提与结论的逻辑正确性通过其一阶逻辑标注确保，并由FOLIO推理引擎自动验证。除主要自然语言推理任务外，FOLIO中的自然语言-一阶语言对还构成一个新的自然语言-一阶逻辑翻译数据集。我们在FOLIO上的实验系统评估了中等规模语言模型经监督微调后的一阶逻辑推理能力。针对自然语言推理和自然语言-一阶逻辑翻译任务，我们基准测试了多个最先进的语言模型。结果表明，FOLIO的子集对当前最具能力的公开大语言模型之一GPT-4构成了挑战。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日