Topic models are widely used to analyze document collections. While they are valuable for discovering latent topics in a corpus when analysts are unfamiliar with the corpus, analysts also commonly start with an understanding of the content present in a corpus. This may be through categories obtained from an initial pass over the corpus or a desire to analyze the corpus through a predefined set of categories derived from a high level theoretical framework (e.g. political ideology). In these scenarios analysts desire a topic modeling approach which incorporates their understanding of the corpus while supporting various forms of interaction with the model. In this work, we present EdTM, as an approach for label name supervised topic modeling. EdTM models topic modeling as an assignment problem while leveraging LM/LLM based document-topic affinities and using optimal transport for making globally coherent topic-assignments. In experiments, we show the efficacy of our framework compared to few-shot LLM classifiers, and topic models based on clustering and LDA. Further, we show EdTM's ability to incorporate various forms of analyst feedback and while remaining robust to noisy analyst inputs.
翻译:主题模型被广泛用于分析文档集合。当分析人员对语料库内容不熟悉时,这些模型对于发现语料库中的潜在主题具有重要价值;然而,分析人员也常常从对语料库内容已有一定理解开始工作。这种理解可能源于对语料库的初步浏览所获得的类别,或者源于希望通过源自高层理论框架(例如政治意识形态)的预定义类别集合来分析语料库。在这些场景中,分析人员需要一种主题建模方法,该方法既能融入他们对语料库的理解,又能支持与模型进行多种形式的交互。在本工作中,我们提出了EdTM,作为一种标签名称监督的主题建模方法。EdTM将主题建模建模为一个分配问题,同时利用基于LM/LLM的文档-主题亲和度,并使用最优传输来做出全局一致的主题分配。在实验中,我们展示了我们的框架相比于少样本LLM分类器,以及基于聚类和LDA的主题模型的有效性。此外,我们展示了EdTM能够融入多种形式的分析人员反馈,同时保持对噪声分析人员输入的鲁棒性。