We address the challenge of incorporating document-level metadata into topic modeling to improve topic mixture estimation. To overcome the computational complexity and lack of theoretical guarantees in existing Bayesian methods, we extend probabilistic latent semantic indexing (pLSI), a frequentist framework for topic modeling, by incorporating document-level covariates or known similarities between documents through a graph formalism. Modeling documents as nodes and edges denoting similarities, we propose a new estimator based on a fast graph-regularized iterative singular value decomposition (SVD) that encourages similar documents to share similar topic mixture proportions. We characterize the estimation error of our proposed method by deriving high-probability bounds and develop a specialized cross-validation method to optimize our regularization parameters. We validate our model through comprehensive experiments on synthetic datasets and three real-world corpora, demonstrating improved performance and faster inference compared to existing Bayesian methods.
翻译:我们解决了将文档级元数据融入主题建模以改进主题混合估计的挑战。为克服现有贝叶斯方法计算复杂度高且缺乏理论保证的问题,我们扩展了概率潜在语义索引(pLSI)这一主题建模的频率主义框架,通过图形式化方法纳入文档级协变量或文档间已知相似性。通过将文档建模为节点、以边表示相似性,我们提出了一种基于快速图正则化迭代奇异值分解(SVD)的新估计器,该估计器促使相似文档共享相似的主题混合比例。我们通过推导高概率界来刻画所提方法的估计误差,并开发了专门的交叉验证方法来优化正则化参数。通过在合成数据集和三个真实语料库上的综合实验,我们验证了所提模型,证明其相较于现有贝叶斯方法具有更优的性能和更快的推理速度。