Hierarchical topic modeling aims to discover latent topics from a corpus and organize them into a hierarchy to understand documents with desirable semantic granularity. However, existing work struggles with producing topic hierarchies of low affinity, rationality, and diversity, which hampers document understanding. To overcome these challenges, we in this paper propose Transport Plan and Context-aware Hierarchical Topic Model (TraCo). Instead of early simple topic dependencies, we propose a transport plan dependency method. It constrains dependencies to ensure their sparsity and balance, and also regularizes topic hierarchy building with them. This improves affinity and diversity of hierarchies. We further propose a context-aware disentangled decoder. Rather than previously entangled decoding, it distributes different semantic granularity to topics at different levels by disentangled decoding. This facilitates the rationality of hierarchies. Experiments on benchmark datasets demonstrate that our method surpasses state-of-the-art baselines, effectively improving the affinity, rationality, and diversity of hierarchical topic modeling with better performance on downstream tasks.
翻译:分层主题建模旨在从语料库中发现潜在主题并将其组织成层级结构,从而以理想的语义粒度理解文档。然而,现有方法难以生成具有高亲和性、合理性与多样性的主题层级结构,这阻碍了文档理解。为应对这些挑战,本文提出传输计划与上下文感知分层主题模型(TraCo)。我们摒弃早期简单的主题依赖关系,提出一种传输计划依赖方法。该方法通过约束依赖关系确保其稀疏性与平衡性,并利用这些依赖关系对主题层级构建进行正则化,从而提升层级的亲和性与多样性。此外,我们提出一种上下文感知的解耦解码器。与先前纠缠的解码方式不同,该解码器通过解耦解码将不同语义粒度分配到不同层级的主题中,从而促进层级的合理性。在基准数据集上的实验表明,我们的方法超越了当前最优基线模型,有效提升了分层主题建模的亲和性、合理性与多样性,并在下游任务中取得了更优性能。