In recent years, fully automated content analysis based on probabilistic topic models has become popular among social scientists because of their scalability. The unsupervised nature of the models makes them suitable for exploring topics in a corpus without prior knowledge. However, researchers find that these models often fail to measure specific concepts of substantive interest by inadvertently creating multiple topics with similar content and combining distinct themes into a single topic. In this paper, we empirically demonstrate that providing a small number of keywords can substantially enhance the measurement performance of topic models. An important advantage of the proposed keyword assisted topic model (keyATM) is that the specification of keywords requires researchers to label topics prior to fitting a model to the data. This contrasts with a widespread practice of post-hoc topic interpretation and adjustments that compromises the objectivity of empirical findings. In our application, we find that keyATM provides more interpretable results, has better document classification performance, and is less sensitive to the number of topics than the standard topic models. Finally, we show that keyATM can also incorporate covariates and model time trends. An open-source software package is available for implementing the proposed methodology.
翻译:近年来,基于概率主题模型的自动化内容分析因其可扩展性而受到社会科学研究者的青睐。模型的非监督特性使其适合在缺乏先验知识的情况下探索语料库中的主题。然而,研究者发现,这些模型往往因无意中生成多个内容相似的主题或将不同主题合并为一个主题,而难以准确衡量具有实质研究意义的具体概念。本文通过实证表明,提供少量关键词即可显著提升主题模型的测量性能。所提出的关键词辅助主题模型(keyATM)的重要优势在于:关键词的指定要求研究者在拟合模型前预先标注主题——这与事后解读与调整主题的普遍做法形成鲜明对比,而后者会损害实证结果的客观性。在应用案例中,我们发现keyATM相较于标准主题模型能提供更可解释的结果、更优的文档分类性能,且对主题数量设定更不敏感。最后,我们证明keyATM还可纳入协变量并建模时间趋势。我们提供了开源软件包用于实现该方法。