We present Local Graph-based Dictionary Expansion (LGDE), a method for data-driven discovery of the semantic neighbourhood of words using tools from manifold learning and network science. At the heart of LGDE lies the creation of a word similarity graph from the geometry of word embeddings followed by local community detection based on graph diffusion. The diffusion in the local graph manifold allows the exploration of the complex nonlinear geometry of word embeddings to capture word similarities based on paths of semantic association, over and above direct pairwise similarities. Exploiting such semantic neighbourhoods enables the expansion of dictionaries of pre-selected keywords, an important step for tasks in information retrieval, such as database queries and online data collection. We validate LGDE on two user-generated English-language corpora and show that LGDE enriches the list of keywords with improved performance relative to methods based on direct word similarities or co-occurrences. We further demonstrate our method through a real-world use case from communication science, where LGDE is evaluated quantitatively on the expansion of a conspiracy-related dictionary from online data collected and analysed by domain experts. Our empirical results and expert user assessment indicate that LGDE expands the seed dictionary with more useful keywords due to the manifold-learning-based similarity network.
翻译:本文提出了一种基于局部图的词典扩展方法(LGDE),该方法利用流形学习和网络科学的工具,实现数据驱动的词语语义邻域发现。LGDE的核心在于:首先基于词嵌入的几何结构构建词语相似度图,随后通过基于图扩散的局部社区检测方法进行分析。局部图流形中的扩散过程能够探索词嵌入的复杂非线性几何结构,从而捕捉基于语义关联路径的词语相似性,而不仅仅是直接的两两相似度。利用此类语义邻域可实现预选关键词词典的扩展,这是信息检索任务(如数据库查询和在线数据收集)中的重要步骤。我们在两个用户生成的英语语料库上验证了LGDE,结果表明相较于基于直接词语相似度或共现关系的方法,LGDE能以更优性能丰富关键词列表。我们进一步通过传播科学中的实际用例展示了该方法,其中由领域专家收集和分析的在线数据对阴谋论相关词典进行扩展,并对LGDE进行了定量评估。实证结果与专家用户评估表明,基于流形学习的相似性网络使LGDE能为种子词典扩展更具实用价值的关键词。