While many natural language inference (NLI) datasets target certain semantic phenomena, e.g., negation, tense & aspect, monotonicity, and presupposition, to the best of our knowledge, there is no NLI dataset that involves diverse types of spatial expressions and reasoning. We fill this gap by semi-automatically creating an NLI dataset for spatial reasoning, called SpaceNLI. The data samples are automatically generated from a curated set of reasoning patterns, where the patterns are annotated with inference labels by experts. We test several SOTA NLI systems on SpaceNLI to gauge the complexity of the dataset and the system's capacity for spatial reasoning. Moreover, we introduce a Pattern Accuracy and argue that it is a more reliable and stricter measure than the accuracy for evaluating a system's performance on pattern-based generated data samples. Based on the evaluation results we find that the systems obtain moderate results on the spatial NLI problems but lack consistency per inference pattern. The results also reveal that non-projective spatial inferences (especially due to the "between" preposition) are the most challenging ones.
翻译:尽管许多自然语言推理(NLI)数据集针对特定语义现象(例如否定、时态与体、单调性和预设),但据我们所知,目前尚无涉及多种空间表达与推理的NLI数据集。我们通过半自动方式构建了一个用于空间推理的NLI数据集——SpaceNLI,以此填补这一空白。该数据样本基于一组经过专家标注推理标签的推理模式自动生成。我们测试了多个前沿NLI系统在SpaceNLI上的表现,以评估数据集的复杂度及系统的空间推理能力。此外,我们提出“模式准确率”指标,并论证其相较于传统准确率,在评估系统对基于模式生成的数据样本的性能时更为可靠且严格。基于评估结果,我们发现各系统在空间NLI问题上的表现中等,但缺乏每个推理模式的一致性。结果还表明,非投射性空间推理(尤其是由介词"between"引发的推理)最具挑战性。