In this paper, we propose an alternative to deep neural networks for semantic information retrieval for the case of long documents. This new approach exploiting clustering techniques to take into account the meaning of words in Information Retrieval systems targeting long as well as short documents. This approach uses a specially designed clustering algorithm to group words with similar meanings into clusters. The dual representation (lexical and semantic) of documents and queries is based on the vector space model proposed by Gerard Salton in the vector space constituted by the formed clusters. The originalities of our proposal are at several levels: first, we propose an efficient algorithm for the construction of clusters of semantically close words using word embedding as input, then we define a formula for weighting these clusters, and then we propose a function allowing to combine efficiently the meanings of words with a lexical model widely used in Information Retrieval. The evaluation of our proposal in three contexts with two different datasets SQuAD and TREC-CAR has shown that is significantly improves the classical approaches only based on the keywords without degrading the lexical aspect.
翻译:本文提出了一种替代深度神经网络的语义信息检索方法,专门针对长文档场景。该方法利用聚类技术来考虑信息检索系统中单词的语义含义,适用于长文档及短文档。具体而言,我们设计了一种专门的聚类算法,将语义相似的单词归入同一类别。文档与查询的双重表示(词汇层面与语义层面)基于杰拉德·索尔顿提出的向量空间模型,该模型由所生成的词类簇构成。本研究的创新性体现在多个层面:首先,我们提出了一种高效算法,利用词嵌入作为输入构建语义相近词的聚类;其次,定义了这些词类的权重计算公式;最后,设计了一个函数用于将单词语义与信息检索中广泛使用的词汇模型有效结合。在SQuAD和TREC-CAR两个不同数据集上的三项评估实验表明,本方法相较于仅依赖关键词的传统方法有显著改进,且未削弱词汇层面的表现。