Vector representations of natural language are ubiquitous in search applications. Recently, various methods based on contrastive learning have been proposed to learn textual representations from unlabelled data; by maximizing alignment between minimally-perturbed embeddings of the same text, and encouraging a uniform distribution of embeddings across a broader corpus. Differently, we propose maximizing alignment between texts and a composition of their phrasal constituents. We consider several realizations of this objective and elaborate the impact on representations in each case. Experimental results on semantic textual similarity tasks show improvements over baselines that are comparable with state-of-the-art approaches. Moreover, this work is the first to do so without incurring costs in auxiliary training objectives or additional network parameters.
翻译:自然语言的向量表示在搜索应用中无处不在。近年来,基于对比学习的多种方法被提出,用于从无标签数据中学习文本表示;这些方法通过最大化同一文本最小扰动嵌入之间的对齐度,并鼓励嵌入在更广泛语料库中均匀分布。与此不同,我们提出最大化文本与其短语成分组合之间的对齐度。我们考虑了该目标的几种实现方式,并详细阐述了每种情况下对表示的影响。在语义文本相似性任务上的实验结果表明,与基线方法相比,我们的方法取得了可比的改进,达到了与最先进方法相当的性能。此外,本工作首次在不引入辅助训练目标或额外网络参数成本的情况下实现这一点。