Query2GMM: Learning Representation with Gaussian Mixture Model for Reasoning over Knowledge Graphs

Logical query answering over Knowledge Graphs (KGs) is a fundamental yet complex task. A promising approach to achieve this is to embed queries and entities jointly into the same embedding space. Research along this line suggests that using multi-modal distribution to represent answer entities is more suitable than uni-modal distribution, as a single query may contain multiple disjoint answer subsets due to the compositional nature of multi-hop queries and the varying latent semantics of relations. However, existing methods based on multi-modal distribution roughly represent each subset without capturing its accurate cardinality, or even degenerate into uni-modal distribution learning during the reasoning process due to the lack of an effective similarity measure. To better model queries with diversified answers, we propose Query2GMM for answering logical queries over knowledge graphs. In Query2GMM, we present the GMM embedding to represent each query using a univariate Gaussian Mixture Model (GMM). Each subset of a query is encoded by its cardinality, semantic center and dispersion degree, allowing for precise representation of multiple subsets. Then we design specific neural networks for each operator to handle the inherent complexity that comes with multi-modal distribution while alleviating the cascading errors. Last, we design a new similarity measure to assess the relationships between an entity and a query's multi-answer subsets, enabling effective multi-modal distribution learning for reasoning. Comprehensive experimental results show that Query2GMM outperforms the best competitor by an absolute average of $6.35\%$.

翻译：知识图谱上的逻辑查询回答是一项基础且复杂的任务。一种有前景的方法是将查询和实体联合嵌入到同一空间中。研究表明，由于多跳查询的组合性质以及关系的潜在语义差异，单个查询可能包含多个不相交的答案子集，因此使用多模态分布来表示答案实体比单模态分布更为合适。然而，现有的基于多模态分布的方法粗略地表示每个子集，未能捕捉其精确基数，甚至因缺乏有效的相似度度量而在推理过程中退化为单模态分布学习。为了更好地建模具有多样化答案的查询，我们提出了Query2GMM，用于回答知识图谱上的逻辑查询。在Query2GMM中，我们采用GMM嵌入，利用单变量高斯混合模型表示每个查询。查询的每个子集由其基数、语义中心和离散程度编码，从而实现对多个子集的精确表示。随后，我们为每个操作符设计了专门的神经网络，以处理多模态分布带来的固有复杂性，同时减轻级联误差。最后，我们设计了一种新的相似度度量，用于评估实体与查询的多答案子集之间的关系，从而实现有效的多模态分布学习推理。综合实验结果表明，Query2GMM相比最佳竞争对手平均绝对性能提升了$6.35\%$。