Word embedding has become ubiquitous and is widely used in various text mining and natural language processing (NLP) tasks, such as information retrieval, semantic analysis, and machine translation, among many others. Unfortunately, it is prohibitively expensive to train the word embedding in a relatively large corpus. We propose a graph-based word embedding algorithm, called Word-Graph2vec, which converts the large corpus into a word co-occurrence graph, then takes the word sequence samples from this graph by randomly traveling and trains the word embedding on this sampling corpus in the end. We posit that because of the stable vocabulary, relative idioms, and fixed expressions in English, the size and density of the word co-occurrence graph change slightly with the increase in the training corpus. So that Word-Graph2vec has stable runtime on the large scale data set, and its performance advantage becomes more and more obvious with the growth of the training corpus. Extensive experiments conducted on real-world datasets show that the proposed algorithm outperforms traditional Skip-Gram by four-five times in terms of efficiency, while the error generated by the random walk sampling is small.
翻译:词嵌入已成为文本挖掘和自然语言处理(NLP)任务中不可或缺的技术,广泛应用于信息检索、语义分析和机器翻译等领域。然而,在较大语料库上训练词嵌入的计算成本过高。我们提出一种基于图的词嵌入算法Word-Graph2vec,该算法首先将大规模语料库转换为词共现图,通过随机游走从图中抽取词序列样本,最终在该采样语料库上训练词嵌入。我们认为,由于英语中词汇的稳定性、相对固定的习语和固定表达,词共现图的规模和密度会随训练语料增加而保持微小变化。因此,Word-Graph2vec在大规模数据集上具有稳定的运行时间,且随着训练语料的增长,其性能优势愈发显著。在真实数据集上进行的大量实验表明,所提算法在效率上比传统Skip-Gram模型快4-5倍,且随机游走采样产生的误差较小。