Natural Language Inference (NLI) tasks require identifying the relationship between sentence pairs, typically classified as entailment, contradiction, or neutrality. While the current state-of-the-art (SOTA) model, Entailment Few-Shot Learning (EFL), achieves a 93.1% accuracy on the Stanford Natural Language Inference (SNLI) dataset, further advancements are constrained by the dataset's limitations. To address this, we propose a novel approach leveraging synthetic data augmentation to enhance dataset diversity and complexity. We present UnitedSynT5, an advanced extension of EFL that leverages a T5-based generator to synthesize additional premise-hypothesis pairs, which are rigorously cleaned and integrated into the training data. These augmented examples are processed within the EFL framework, embedding labels directly into hypotheses for consistency. We train a GTR-T5-XL model on this expanded dataset, achieving a new benchmark of 94.7% accuracy on the SNLI dataset, 94.0% accuracy on the E-SNLI dataset, and 92.6% accuracy on the MultiNLI dataset, surpassing the previous SOTA models. This research demonstrates the potential of synthetic data augmentation in improving NLI models, offering a path forward for further advancements in natural language understanding tasks.
翻译:自然语言推理(NLI)任务需要识别句子对之间的关系,通常分为蕴含、矛盾或中立三类。虽然当前最先进的(SOTA)模型——蕴含少样本学习(EFL)在斯坦福自然语言推理(SNLI)数据集上达到了93.1%的准确率,但数据集的局限性制约了进一步的提升。为解决这一问题,我们提出了一种利用合成数据增强来提高数据集多样性和复杂性的新方法。我们介绍了UnitedSynT5,这是EFL的一个高级扩展,它利用基于T5的生成器合成额外的前提-假设对,这些数据经过严格清洗并整合到训练数据中。这些增强的示例在EFL框架内进行处理,将标签直接嵌入假设中以保持一致性。我们在扩展后的数据集上训练了一个GTR-T5-XL模型,在SNLI数据集上达到了94.7%的准确率,在E-SNLI数据集上达到94.0%,在MultiNLI数据集上达到92.6%,超越了之前的SOTA模型。这项研究展示了合成数据增强在改进NLI模型方面的潜力,为自然语言理解任务的进一步发展提供了路径。