Computation of document similarity is a critical task in various NLP domains that has applications in deduplication, matching, and recommendation. Traditional approaches for document similarity computation include learning representations of documents and employing a similarity or a distance function over the embeddings. However, pairwise similarities and differences are not efficiently captured by individual representations. Graph representations such as Joint Concept Interaction Graph (JCIG) represent a pair of documents as a joint undirected weighted graph. JCIGs facilitate an interpretable representation of document pairs as a graph. However, JCIGs are undirected, and don't consider the sequential flow of sentences in documents. We propose two approaches to model document similarity by representing document pairs as a directed and sparse JCIG that incorporates sequential information. We propose two algorithms inspired by Supergenome Sorting and Hamiltonian Path that replace the undirected edges with directed edges. Our approach also sparsifies the graph to $O(n)$ edges from JCIG's worst case of $O(n^2)$. We show that our sparse directed graph model architecture consisting of a Siamese encoder and GCN achieves comparable results to the baseline on datasets not containing sequential information and beats the baseline by ten points on an instructional documents dataset containing sequential information.
翻译:文档相似度计算是自然语言处理各领域中的关键任务,广泛应用于去重、匹配与推荐。传统文档相似度计算方法包括学习文档表示,并基于嵌入向量应用相似度或距离函数。然而,个体表示难以高效捕获成对文档的相似性与差异。图表示方法(如联合概念交互图JCIG)将文档对建模为无向加权联合图,提供可解释的文档对图表示。但JCIG为无向结构,未能考虑文档中句子的序列流向。本文提出两种方法,通过将文档对表示为融合序列信息的带权有向稀疏JCIG来建模文档相似度。我们受超基因组排序与哈密顿路径启发,提出两种算法将无向边替换为有向边。本方法还将图从JCIG最坏情况下的O(n²)边复杂度稀疏化为O(n)边。实验表明,由孪生编码器与GCN构成的稀疏有向图模型架构,在不含序列信息的数据集上达到与基线相当的结果,而在含序列信息的程序性教学文档数据集上,该模型性能超出基线十个点。