Word-Graph2vec: An efficient word embedding approach on word co-occurrence graph using random walk sampling

Word embedding has become ubiquitous and is widely used in various text mining and natural language processing (NLP) tasks, such as information retrieval, semantic analysis, and machine translation, among many others. Unfortunately, it is prohibitively expensive to train the word embedding in a relatively large corpus. We propose a graph-based word embedding algorithm, called Word-Graph2vec, which converts the large corpus into a word co-occurrence graph, then takes the word sequence samples from this graph by randomly traveling and trains the word embedding on this sampling corpus in the end. We posit that because of the stable vocabulary, relative idioms, and fixed expressions in English, the size and density of the word co-occurrence graph change slightly with the increase in the training corpus. So that Word-Graph2vec has stable runtime on the large scale data set, and its performance advantage becomes more and more obvious with the growth of the training corpus. Extensive experiments conducted on real-world datasets show that the proposed algorithm outperforms traditional Skip-Gram by four-five times in terms of efficiency, while the error generated by the random walk sampling is small.

翻译：词嵌入已成为文本挖掘和自然语言处理（NLP）任务中不可或缺的技术，广泛应用于信息检索、语义分析和机器翻译等领域。然而，在较大语料库上训练词嵌入的计算成本过高。我们提出一种基于图的词嵌入算法Word-Graph2vec，该算法首先将大规模语料库转换为词共现图，通过随机游走从图中抽取词序列样本，最终在该采样语料库上训练词嵌入。我们认为，由于英语中词汇的稳定性、相对固定的习语和固定表达，词共现图的规模和密度会随训练语料增加而保持微小变化。因此，Word-Graph2vec在大规模数据集上具有稳定的运行时间，且随着训练语料的增长，其性能优势愈发显著。在真实数据集上进行的大量实验表明，所提算法在效率上比传统Skip-Gram模型快4-5倍，且随机游走采样产生的误差较小。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日