In recent years, the availability of large-scale annotated datasets, such as the Stanford Natural Language Inference and the Multi-Genre Natural Language Inference, coupled with the advent of pre-trained language models, has significantly contributed to the development of the natural language inference domain. However, these crowdsourced annotated datasets often contain biases or dataset artifacts, leading to overestimated model performance and poor generalization. In this work, we focus on investigating dataset artifacts and developing strategies to address these issues. Through the utilization of a novel statistical testing procedure, we discover a significant association between vocabulary distribution and text entailment classes, emphasizing vocabulary as a notable source of biases. To mitigate these issues, we propose several automatic data augmentation strategies spanning character to word levels. By fine-tuning the ELECTRA pre-trained language model, we compare the performance of boosted models with augmented data against their baseline counterparts. The experiments demonstrate that the proposed approaches effectively enhance model accuracy and reduce biases by up to 0.66% and 1.14%, respectively.
翻译:近年来,大规模标注数据集(如斯坦福自然语言推理和多体裁自然语言推理)的可用性,结合预训练语言模型的出现,极大地推动了自然语言推理领域的发展。然而,这些众包标注数据集通常包含偏差或数据集伪影,导致模型性能被高估且泛化能力较差。本研究致力于探究数据集伪影并制定解决这些问题的策略。通过应用一种新颖的统计检验程序,我们发现词汇分布与文本蕴含类别之间存在显著关联,凸显词汇是偏差的重要来源。为减轻这些问题,我们提出了多种从字符到词级别的自动数据增强策略。通过微调ELECTRA预训练语言模型,我们将增强数据训练的增强模型与基线模型进行性能比较。实验表明,所提出的方法能够有效提升模型准确性,并分别降低最多0.66%和1.14%的偏差。