We present Unsupervised hard Negative Augmentation (UNA), a method that generates synthetic negative instances based on the term frequency-inverse document frequency (TF-IDF) retrieval model. UNA uses TF-IDF scores to ascertain the perceived importance of terms in a sentence and then produces negative samples by replacing terms with respect to that. Our experiments demonstrate that models trained with UNA improve the overall performance in semantic textual similarity tasks. Additional performance gains are obtained when combining UNA with the paraphrasing augmentation. Further results show that our method is compatible with different backbone models. Ablation studies also support the choice of having a TF-IDF-driven control on negative augmentation.
翻译:我们提出了无监督强负样本增强(UNA)方法,该方法基于词频-逆文档频率(TF-IDF)检索模型生成合成负样本。UNA利用TF-IDF分数确定句子中词汇的感知重要性,并据此替换词汇以生成负样本。实验表明,使用UNA训练的模型在语义文本相似度任务中整体性能得到提升。当与释义增强方法结合使用时,可获得额外的性能增益。进一步结果显示,我们的方法兼容不同骨干模型。消融研究也支持采用基于TF-IDF控制的负样本增强方案。