Learning Word Embedding with Better Distance Weighting and Window Size Scheduling

Distributed word representation (a.k.a. word embedding) is a key focus in natural language processing (NLP). As a highly successful word embedding model, Word2Vec offers an efficient method for learning distributed word representations on large datasets. However, Word2Vec lacks consideration for distances between center and context words. We propose two novel methods, Learnable Formulated Weights (LFW) and Epoch-based Dynamic Window Size (EDWS), to incorporate distance information into two variants of Word2Vec, the Continuous Bag-of-Words (CBOW) model and the Continuous Skip-gram (Skip-gram) model. For CBOW, LFW uses a formula with learnable parameters that best reflects the relationship of influence and distance between words to calculate distance-related weights for average pooling, providing insights for future NLP text modeling research. For Skip-gram, we improve its dynamic window size strategy to introduce distance information in a more balanced way. Experiments prove the effectiveness of LFW and EDWS in enhancing Word2Vec's performance, surpassing previous state-of-the-art methods.

翻译：分布式词表示（即词嵌入）是自然语言处理（NLP）中的一个关键研究方向。作为一款极为成功的词嵌入模型，Word2Vec 提供了一种在大规模数据集上高效学习分布式词表示的方法。然而，Word2Vec 缺乏对中心词与上下文词之间距离的考量。我们提出了两种新颖方法——可学习公式化权重（LFW）和基于时期的动态窗口大小（EDWS）——将距离信息融入 Word2Vec 的两种变体中，即连续词袋（CBOW）模型和连续跳元（Skip-gram）模型。对于 CBOW，LFW 采用一个具有可学习参数的公式，该公式能够最佳地反映词之间影响力与距离的关系，以计算与距离相关的平均池化权重，为未来 NLP 文本建模研究提供思路。对于 Skip-gram，我们改进了其动态窗口大小策略，以更均衡的方式引入距离信息。实验证明，LFW 和 EDWS 能有效提升 Word2Vec 的性能，并超越了此前最先进的方法。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日