Dense vector representations for sentences made significant progress in recent years as can be seen on sentence similarity tasks. Real-world phrase retrieval applications, on the other hand, still encounter challenges for effective use of dense representations. We show that when target phrases reside inside noisy context, representing the full sentence with a single dense vector, is not sufficient for effective phrase retrieval. We therefore look into the notion of representing multiple, sub-sentence, consecutive word spans, each with its own dense vector. We show that this technique is much more effective for phrase mining, yet requires considerable compute to obtain useful span representations. Accordingly, we make an argument for contextualized word/token embeddings that can be aggregated for arbitrary word spans while maintaining the span's semantic meaning. We introduce a modification to the common contrastive loss used for sentence embeddings that encourages word embeddings to have this property. To demonstrate the effect of this method we present a dataset based on the STS-B dataset with additional generated text, that requires finding the best matching paraphrase residing in a larger context and report the degree of similarity to the origin phrase. We demonstrate on this dataset, how our proposed method can achieve better results without significant increase to compute.
翻译:近年来,句子密集向量表示在句子相似度任务中取得了显著进展。然而,在实际的短语检索应用中,密集表示的有效利用仍面临挑战。我们证明,当目标短语位于嘈杂的上下文中时,用单个密集向量表示整个句子不足以实现有效的短语检索。因此,我们研究了用各自密集向量表示多个子句连续词跨度的方法。结果表明,该技术对短语挖掘更为有效,但获取有效的跨度表示需要大量计算。为此,我们提出一种可聚合的上下文词/标记嵌入方法,该方法能在保持跨度语义的同时,对任意词跨度进行聚合。我们引入了一种对常用句子嵌入对比损失的改进,以鼓励词嵌入具备这一特性。为展示该方法的效果,我们基于STS-B数据集构建了一个包含额外生成文本的数据集,该数据集要求从更大的上下文中找到与原始短语最匹配的释义,并报告其相似度。实验表明,我们的方法能在不显著增加计算量的情况下取得更优结果。