Large language models have been shown to behave inconsistently in response to meaning-preserving paraphrastic inputs. At the same time, researchers evaluate the knowledge and reasoning abilities of these models with test evaluations that do not disaggregate the effect of paraphrastic variability on performance. We propose a metric for evaluating the paraphrastic consistency of natural language reasoning models based on the probability of a model achieving the same correctness on two paraphrases of the same problem. We mathematically connect this metric to the proportion of a model's variance in correctness attributable to paraphrasing. To estimate paraphrastic consistency, we collect ParaNLU, a dataset of 7,782 human-written and validated paraphrased reasoning problems constructed on top of existing benchmark datasets for defeasible and abductive natural language inference. Using ParaNLU, we measure the paraphrastic consistency of several model classes and show that consistency dramatically increases with pretraining but not finetuning. All models tested exhibited room for improvement in paraphrastic consistency.
翻译:大型语言模型在面对保持语义一致的同义改写输入时,表现出不一致的行为。与此同时,研究人员在评估这些模型的知识和推理能力时,采用测试评估方法,未将同义变异性对性能的影响进行细分。我们提出一个度量标准,基于模型对同一问题的两个同义改写版本获得相同正确性的概率,来评估自然语言推理模型的同义一致性。我们在数学上将该度量标准与模型正确性中可归因于同义改写的变化比例联系起来。为估算同义一致性,我们收集了ParaNLU数据集,该数据集包含7,782个由人工编写并验证的同义改写推理问题,构建于现有的可废止和溯因自然语言推理基准数据集之上。利用ParaNLU,我们测量了多种模型类别的同义一致性,并发现一致性在预训练后显著提升,但微调并未带来类似效果。所有测试模型在同义一致性方面仍有改进空间。