Span-Aggregatable, Contextualized Word Embeddings for Effective Phrase Mining

Dense vector representations for sentences made significant progress in recent years as can be seen on sentence similarity tasks. Real-world phrase retrieval applications, on the other hand, still encounter challenges for effective use of dense representations. We show that when target phrases reside inside noisy context, representing the full sentence with a single dense vector, is not sufficient for effective phrase retrieval. We therefore look into the notion of representing multiple, sub-sentence, consecutive word spans, each with its own dense vector. We show that this technique is much more effective for phrase mining, yet requires considerable compute to obtain useful span representations. Accordingly, we make an argument for contextualized word/token embeddings that can be aggregated for arbitrary word spans while maintaining the span's semantic meaning. We introduce a modification to the common contrastive loss used for sentence embeddings that encourages word embeddings to have this property. To demonstrate the effect of this method we present a dataset based on the STS-B dataset with additional generated text, that requires finding the best matching paraphrase residing in a larger context and report the degree of similarity to the origin phrase. We demonstrate on this dataset, how our proposed method can achieve better results without significant increase to compute.

翻译：近年来，句子密集向量表示在句子相似度任务中取得了显著进展。然而，在实际的短语检索应用中，密集表示的有效利用仍面临挑战。我们证明，当目标短语位于嘈杂的上下文中时，用单个密集向量表示整个句子不足以实现有效的短语检索。因此，我们研究了用各自密集向量表示多个子句连续词跨度的方法。结果表明，该技术对短语挖掘更为有效，但获取有效的跨度表示需要大量计算。为此，我们提出一种可聚合的上下文词/标记嵌入方法，该方法能在保持跨度语义的同时，对任意词跨度进行聚合。我们引入了一种对常用句子嵌入对比损失的改进，以鼓励词嵌入具备这一特性。为展示该方法的效果，我们基于STS-B数据集构建了一个包含额外生成文本的数据集，该数据集要求从更大的上下文中找到与原始短语最匹配的释义，并报告其相似度。实验表明，我们的方法能在不显著增加计算量的情况下取得更优结果。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日