Natural Language Inference (NLI) evaluation is crucial for assessing language understanding models; however, popular datasets suffer from systematic spurious correlations that artificially inflate actual model performance. To address this, we propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples. We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics. This categorization significantly reduces spurious correlation measures, with examples labeled as having the highest difficulty showing markedly decreased performance and encompassing more realistic and diverse linguistic phenomena. When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset, surpassing other dataset characterization techniques. Our research addresses limitations in NLI dataset construction, providing a more authentic evaluation of model performance with implications for diverse NLU applications.
翻译:自然语言推理(NLI)评估对于衡量语言理解模型至关重要;然而,现有主流数据集普遍存在系统性伪相关性问题,导致模型实际性能被虚高评估。为解决此问题,我们提出一种自动化构建高难度测试集的方法,无需依赖人工构造非自然或不现实的示例。通过利用训练动态分析方法,我们将主流NLI数据集的测试样本划分为三个难度等级。这种分类显著降低了伪相关性指标,其中标注为最高难度的示例表现出明显下降的模型性能,且涵盖更真实、更多样的语言现象。当我们将此特征分析方法应用于训练集时,仅使用部分数据训练的模型即可达到与全数据集训练相当的性能,且优于其他数据集特征分析技术。本研究针对NLI数据集构建的局限性,为模型性能提供了更真实的评估框架,对多样化自然语言理解应用具有重要启示。