Large language models (LLMs) have achieved remarkable performance on a variety of natural language understanding tasks. However, existing benchmarks are inadequate in measuring the complex logical reasoning capabilities of a model. We present FOLIO, a human-annotated, logically complex and diverse dataset for reasoning in natural language (NL), equipped with first-order logic (FOL) annotations. FOLIO consists of 1,430 examples (unique conclusions), each paired with one of 487 sets of premises used to deductively reason for the validity of each conclusion. The logical correctness of the premises and conclusions is ensured by their FOL annotations, which are automatically verified by an FOL inference engine. In addition to the main NL reasoning task, NL-FOL pairs in FOLIO constitute a new NL-FOL translation dataset. Our experiments on FOLIO systematically evaluate the FOL reasoning ability of supervised fine-tuning on medium-sized language models. For both NL reasoning and NL-FOL translation, we benchmark multiple state-of-the-art language models. Our results show that a subset of FOLIO presents a challenge for one of the most capable {Large Language Model (LLM)} publicly available, GPT-4.
翻译:大语言模型在多种自然语言理解任务中取得了显著性能。然而,现有基准在衡量模型复杂逻辑推理能力方面存在不足。我们提出FOLIO——一个经过人工标注、逻辑复杂且多样化的自然语言推理数据集,并配备一阶逻辑标注。FOLIO包含1,430个示例(独立结论),每个示例对应487组前提集合之一,用于演绎推理以验证每个结论的有效性。前提与结论的逻辑正确性通过其一阶逻辑标注确保,并由FOLIO推理引擎自动验证。除主要自然语言推理任务外,FOLIO中的自然语言-一阶语言对还构成一个新的自然语言-一阶逻辑翻译数据集。我们在FOLIO上的实验系统评估了中等规模语言模型经监督微调后的一阶逻辑推理能力。针对自然语言推理和自然语言-一阶逻辑翻译任务,我们基准测试了多个最先进的语言模型。结果表明,FOLIO的子集对当前最具能力的公开大语言模型之一GPT-4构成了挑战。