To precisely evaluate a language model's capability for logical reading comprehension, we present a dataset for testing the understanding of the rationale behind critical reasoning. For questions taken from an existing multiplechoice logical reading comprehension dataset, we crowdsource rationale texts that explain why we should select or eliminate answer options, resulting in 3,003 multiple-choice subquestions that are associated with 943 main questions. Experiments on our dataset show that recent large language models (e.g., InstructGPT) struggle to answer the subquestions even if they are able to answer the main questions correctly. We find that the models perform particularly poorly in answering subquestions written for the incorrect options of the main questions, implying that the models have a limited capability for explaining why incorrect alternatives should be eliminated. These results suggest that our dataset encourages further investigation into the critical reasoning ability of language models while focusing on the elimination process of relevant alternatives.
翻译:为精确评估语言模型在逻辑阅读理解中的能力,我们构建了一个用于测试关键推理背后理据理解的数据集。针对现有逻辑阅读理解多选题数据集中的问题,我们通过众包方式收集了阐释应选择或排除各选项的理据文本,最终形成3,003道与943道主问题相关联的多选题子问题。实验表明,即使现代大型语言模型(如InstructGPT)能正确回答主问题,其在子问题上的表现仍存在困难。研究发现,这些模型在回答针对主问题错误选项编写的子问题时表现尤为薄弱,这表明模型解释为何应排除错误选项的能力有限。上述结果提示:本数据集可促进以相关选项排除过程为重点的语言模型批判推理能力研究。