Natural Language Inference (NLI) tasks require identifying the relationship between sentence pairs, typically classified as entailment, contradiction, or neutrality. While the current state-of-the-art (SOTA) model, Entailment Few-Shot Learning (EFL), achieves a 93.1% accuracy on the Stanford Natural Language Inference (SNLI) dataset, further advancements are constrained by the dataset's limitations. To address this, we propose a novel approach leveraging synthetic data augmentation to enhance dataset diversity and complexity. We present UnitedSynT5, an advanced extension of EFL that leverages a T5-based generator to synthesize additional premise-hypothesis pairs, which are rigorously cleaned and integrated into the training data. These augmented examples are processed within the EFL framework, embedding labels directly into hypotheses for consistency. We train a GTR-T5-XL model on this expanded dataset, achieving a new benchmark of 94.7% accuracy on the SNLI dataset, 94.01% accuracy on the E-SNLI dataset, and 92.57% accuracy on the MultiNLI dataset, surpassing the previous SOTA models. This research demonstrates the potential of synthetic data augmentation in improving NLI models, offering a path forward for further advancements in natural language understanding tasks.
翻译:自然语言推理任务需要识别句子对之间的关系,通常分为蕴含、矛盾或中立三类。虽然当前最先进的模型——蕴含少样本学习在斯坦福自然语言推理数据集上达到了93.1%的准确率,但数据集的局限性制约了进一步突破。为此,我们提出一种利用合成数据增强来提高数据集多样性与复杂度的新方法。我们提出了UnitedSynT5——EFL的进阶扩展模型,它采用基于T5的生成器合成额外的前提-假设对,经过严格清洗后整合到训练数据中。这些增强样本在EFL框架内进行处理,将标签直接嵌入假设以保持一致性。我们在扩展数据集上训练GTR-T5-XL模型,在SNLI数据集上取得94.7%准确率的新基准,在E-SNLI数据集上达到94.01%准确率,在MultiNLI数据集上获得92.57%准确率,全面超越了先前的最先进模型。本研究证明了合成数据增强在改进自然语言推理模型方面的潜力,为自然语言理解任务的进一步发展提供了新路径。