Contrastive learning has recently achieved compelling performance in unsupervised sentence representation. As an essential element, data augmentation protocols, however, have not been well explored. The pioneering work SimCSE resorting to a simple dropout mechanism (viewed as continuous augmentation) surprisingly dominates discrete augmentations such as cropping, word deletion, and synonym replacement as reported. To understand the underlying rationales, we revisit existing approaches and attempt to hypothesize the desiderata of reasonable data augmentation methods: balance of semantic consistency and expression diversity. We then develop three simple yet effective discrete sentence augmentation schemes: punctuation insertion, modal verbs, and double negation. They act as minimal noises at lexical level to produce diverse forms of sentences. Furthermore, standard negation is capitalized on to generate negative samples for alleviating feature suppression involved in contrastive learning. We experimented extensively with semantic textual similarity on diverse datasets. The results support the superiority of the proposed methods consistently. Our key code is available at https://github.com/Zhudongsheng75/SDA
翻译:对比学习最近在无监督句子表征方面取得了引人注目的性能。然而,作为关键要素的数据增强策略尚未得到充分探索。开创性工作SimCSE采用简单的dropout机制(视为连续增强),据报告,其性能令人惊讶地超越了裁剪、词语删除和同义词替换等离散增强方法。为了理解其背后的基本原理,我们重新审视了现有方法,并尝试提出合理数据增强方法的期望特性:语义一致性与表达多样性的平衡。随后,我们开发了三种简单而有效的离散句子增强方案:标点插入、情态动词和双重否定。这些方法在词汇层面作为最小噪声,以产生多样化的句子形式。此外,我们利用标准否定来生成负样本,以缓解对比学习中涉及的特征抑制问题。我们在多个数据集上对语义文本相似性进行了广泛实验。结果一致地支持了所提方法的优越性。我们的核心代码可在https://github.com/Zhudongsheng75/SDA获取。