Topic models are a popular tool for clustering and analyzing textual data. They allow texts to be classified on the basis of their affiliation to the previously calculated topics. Despite their widespread use in research and application, an in-depth analysis of topic models is still an open research topic. State-of-the-art methods for interpreting topic models are based on simple visualizations, such as similarity matrices, top-term lists or embeddings, which are limited to a maximum of three dimensions. In this paper, we propose an incidence-geometric method for deriving an ordinal structure from flat topic models, such as non-negative matrix factorization. These enable the analysis of the topic model in a higher (order) dimension and the possibility of extracting conceptual relationships between several topics at once. Due to the use of conceptual scaling, our approach does not introduce any artificial topical relationships, such as artifacts of feature compression. Based on our findings, we present a new visualization paradigm for concept hierarchies based on ordinal motifs. These allow for a top-down view on topic spaces. We introduce and demonstrate the applicability of our approach based on a topic model derived from a corpus of scientific papers taken from 32 top machine learning venues.
翻译:主题模型是聚类和分析文本数据的流行工具。它们允许根据文本与先前计算出的主题的关联性对其进行分类。尽管主题模型在研究和应用中广泛使用,但对主题模型的深入分析仍是一个开放的研究课题。当前解释主题模型的最新方法依赖于简单的可视化技术,例如相似性矩阵、顶部术语列表或嵌入表示,这些方法最多仅限于三维空间。在本文中,我们提出了一种基于关联几何的方法,用于从平坦主题模型(例如非负矩阵分解)中推导出序结构。这些方法使得能够在更高(阶)维度上分析主题模型,并能够同时提取多个主题之间的概念关系。由于采用了概念标度,我们的方法不会引入任何人为的主题关系,例如特征压缩带来的伪影。基于我们的发现,我们提出了一种基于序模式的概念层次结构新可视化范式,从而实现对主题空间的由上至下视角。我们通过从32个顶级机器学习会议场所的科学论文语料库中导出的主题模型,介绍并演示了我们方法的适用性。