We propose SETI (Systematicity Evaluation of Textual Inference), a novel and comprehensive benchmark designed for evaluating pre-trained language models (PLMs) for their systematicity capabilities in the domain of textual inference. Specifically, SETI offers three different NLI tasks and corresponding datasets to evaluate various types of systematicity in reasoning processes. In order to solve these tasks, models are required to perform compositional inference based on known primitive constituents. We conduct experiments of SETI on six widely used PLMs. Results show that various PLMs are able to solve unseen compositional inferences when having encountered the knowledge of how to combine primitives, with good performance. However, they are considerably limited when this knowledge is unknown to the model (40-100% points decrease). Furthermore, we find that PLMs can improve drastically once exposed to crucial compositional knowledge in minimalistic shots. These findings position SETI as the first benchmark for measuring the future progress of PLMs in achieving systematicity generalization in the textual inference.
翻译:我们提出SETI(文本推理系统性评估),这是一个新颖且全面的基准,旨在评估预训练语言模型在文本推理领域的系统性能力。具体而言,SETI提供了三种不同的自然语言推理任务及对应数据集,用于评估推理过程中的各类系统性能力。为求解这些任务,模型需基于已知原始成分进行组合推理。我们在六种广泛使用的预训练语言模型上开展了SETI实验。结果表明,当模型已掌握基元组合方式时,各类预训练语言模型均能较好解决未见过的组合推理问题;但当模型缺乏该组合知识时,其性能显著受限(性能下降40-100个百分点)。此外,我们发现预训练语言模型在接触极少量关键组合知识后,其能力可获得显著提升。上述发现使SETI成为首个用于衡量预训练语言模型在文本推理中实现系统性泛化进展的基准。