Evaluating the impact of word embeddings on similarity scoring in practical information retrieval

Search behaviour is characterised using synonymy and polysemy as users often want to search information based on meaning. Semantic representation strategies represent a move towards richer associative connections that can adequately capture this complex usage of language. Vector Space Modelling (VSM) and neural word embeddings play a crucial role in modern machine learning and Natural Language Processing (NLP) pipelines. Embeddings use distributional semantics to represent words, sentences, paragraphs or entire documents as vectors in high dimensional spaces. This can be leveraged by Information Retrieval (IR) systems to exploit the semantic relatedness between queries and answers. This paper evaluates an alternative approach to measuring query statement similarity that moves away from the common similarity measure of centroids of neural word embeddings. Motivated by the Word Movers Distance (WMD) model, similarity is evaluated using the distance between individual words of queries and statements. Results from ranked query and response statements demonstrate significant gains in accuracy using the combined approach of similarity ranking through WMD with the word embedding techniques. The top performing WMD + GloVe combination outperforms all other state-of-the-art retrieval models including Doc2Vec and the baseline LSA model. Along with the significant gains in performance of similarity ranking through WMD, we conclude that the use of pre-trained word embeddings, trained on vast amounts of data, result in domain agnostic language processing solutions that are portable to diverse business use-cases.

翻译：搜索行为以同义词和多义词为特征，因为用户通常希望基于语义进行信息检索。语义表示策略旨在建立更丰富的关联连接，以充分捕捉语言的这种复杂使用方式。向量空间建模（VSM）和神经词嵌入在现代机器学习和自然语言处理（NLP）流程中发挥着关键作用。嵌入利用分布语义将单词、句子、段落或整个文档表示为高维空间中的向量。信息检索（IR）系统可以利用这一点来挖掘查询与答案之间的语义关联性。本文评估了一种测量查询语句相似度的替代方法，该方法摒弃了常用的神经词嵌入质心相似度度量。受词移距离（WMD）模型的启发，该方法通过计算查询与语句中单个词语之间的距离来评估相似性。对排序后的查询和响应语句的实验结果表明，结合WMD相似性排序与词嵌入技术的联合方法在准确率上取得了显著提升。表现最佳的WMD + GloVe组合优于包括Doc2Vec和基线LSA模型在内的所有其他先进检索模型。除了通过WMD实现相似性排序性能的显著提升外，我们得出结论：使用基于海量数据预训练的词嵌入，可产生与领域无关的语言处理解决方案，并能迁移至多样化的商业应用场景。