NLP models have progressed drastically in recent years, according to numerous datasets proposed to evaluate performance. Questions remain, however, about how particular dataset design choices may impact the conclusions we draw about model capabilities. In this work, we investigate this question in the domain of compositional generalization. We examine the performance of six modeling approaches across 4 datasets, split according to 8 compositional splitting strategies, ranking models by 18 compositional generalization splits in total. Our results show that: i) the datasets, although all designed to evaluate compositional generalization, rank modeling approaches differently; ii) datasets generated by humans align better with each other than they with synthetic datasets, or than synthetic datasets among themselves; iii) generally, whether datasets are sampled from the same source is more predictive of the resulting model ranking than whether they maintain the same interpretation of compositionality; and iv) which lexical items are used in the data can strongly impact conclusions. Overall, our results demonstrate that much work remains to be done when it comes to assessing whether popular evaluation datasets measure what they intend to measure, and suggest that elucidating more rigorous standards for establishing the validity of evaluation sets could benefit the field.
翻译:近年来,根据众多用于评估性能的数据集,自然语言处理模型取得了显著进展。然而,关于特定数据集设计选择如何影响我们对模型能力的结论,仍存在疑问。本研究在组合泛化领域探讨了此问题。我们考察了六种建模方法在四个数据集上的表现,这些数据集根据八种组合分割策略进行划分,总共通过18个组合泛化分割对模型进行排序。结果表明:i) 尽管所有数据集均旨在评估组合泛化,但它们对建模方法的排序结果不同;ii) 人类生成的数据集之间的一致性高于其与合成数据集的一致性,也高于合成数据集自身之间的一致性;iii) 总体而言,数据集是否源自同一来源比它们是否保持相同的组合性解释更能预测模型排序结果;iv) 数据中使用的词汇项可能对结论产生强烈影响。综上所述,我们的结果表明,在评估流行评估数据集是否测量了其预期目标方面,仍有大量工作有待完成,并建议阐明更严格的标准以确立评估集的有效性可惠及该领域。