Existing NTMs with contrastive learning suffer from the sample bias problem owing to the word frequency-based sampling strategy, which may result in false negative samples with similar semantics to the prototypes. In this paper, we aim to explore the efficient sampling strategy and contrastive learning in NTMs to address the aforementioned issue. We propose a new sampling assumption that negative samples should contain words that are semantically irrelevant to the prototype. Based on it, we propose the graph contrastive topic model (GCTM), which conducts graph contrastive learning (GCL) using informative positive and negative samples that are generated by the graph-based sampling strategy leveraging in-depth correlation and irrelevance among documents and words. In GCTM, we first model the input document as the document word bipartite graph (DWBG), and construct positive and negative word co-occurrence graphs (WCGs), encoded by graph neural networks, to express in-depth semantic correlation and irrelevance among words. Based on the DWBG and WCGs, we design the document-word information propagation (DWIP) process to perform the edge perturbation of DWBG, based on multi-hop correlations/irrelevance among documents and words. This yields the desired negative and positive samples, which will be utilized for GCL together with the prototypes to improve learning document topic representations and latent topics. We further show that GCL can be interpreted as the structured variational graph auto-encoder which maximizes the mutual information of latent topic representations of different perspectives on DWBG. Experiments on several benchmark datasets demonstrate the effectiveness of our method for topic coherence and document representation learning compared with existing SOTA methods.
翻译:现有基于对比学习的神经主题模型因采用基于词频的采样策略,存在样本偏差问题,可能导致生成与原型语义相似的假阴性样本。本文旨在探索神经主题模型中高效的采样策略与对比学习方法以解决上述问题。我们提出新的采样假设:负样本应包含与原型语义无关的词。基于此,我们提出图对比主题模型(GCTM),利用基于图的采样策略生成具有深层文档-词相关性与无关性的信息性正负样本,进行图对比学习(GCL)。在GCTM中,我们首先将输入文档建模为文档-词二分图(DWBG),并构建由图神经网络编码的正负词共现图(WCG)以表达词之间的深层语义相关性与无关性。基于DWBG与WCG,我们设计文档-词信息传播(DWIP)过程,根据文档与词的多跳相关/无关性执行DWBG边扰动,从而生成所需的正负样本。这些样本将与原型共同用于GCL,以改进文档主题表示与潜在主题的学习。我们进一步论证GCL可解释为最大化DWBG不同视角下潜在主题表示互信息的结构化变分图自编码器。在多个基准数据集上的实验表明,与现有最先进方法相比,本方法在主题一致性与文档表示学习方面具有有效性。