Topic modeling in applied psychology increasingly spans two methodological traditions: probabilistic bag-of-words models and newer embedding-based approaches. Yet many evaluations of these methods rely on longer and cleaner benchmark corpora, leaving less guidance for short, open-ended survey responses. This paper compares Structural Topic Models (STM), a probabilistic topic model, and BERTopic, an embedding-based model, for analyzing open-ended survey responses. We evaluated three STM conditions and five BERTopic conditions, varying typographical correction, stemming, embedding choice, and contextual augmentation, a strategy we introduced to provide additional semantic context for very short responses. Results indicate that BERTopic consistently produced higher topic coherence than STM, with contextual augmentation yielding the strongest performance gains. In contrast, higher-dimensional embeddings alone did not improve coherence and were associated with greater data loss. Qualitative evaluation showed that BERTopic generated more interpretable and stable topics, while STM topics were often broader and more mixed. However, STM provides stronger support for inferential covariate analysis, whereas BERTopic covariate comparisons are primarily descriptive. These findings suggest that STM and BERTopic offer complementary strengths. We conclude with practical guidance for selecting and combining topic modeling approaches in applied social science research.
翻译:应用心理学中的主题建模日益跨越两种方法论传统:概率性词袋模型与新兴的基于嵌入的方法。然而,许多方法评估依赖较长且更清洁的基准语料库,对短文本开放式调查回答的指导较少。本文比较了结构主题模型(一种概率主题模型)与BERTopic(一种基于嵌入的模型)在分析开放式调查回答中的表现。我们评估了三种结构主题模型条件和五种BERTopic条件,涉及拼写纠正、词干提取、嵌入选择以及我们提出的上下文增强策略——该策略旨在为极短回答提供额外语义上下文。结果表明:BERTopic始终产生比结构主题模型更高的主题连贯性,其中上下文增强带来了最强的性能提升;相反,仅使用高维嵌入未能改善连贯性,反而与更大数据损失相关。定性评估显示,BERTopic生成更可解释且更稳定的主题,而结构主题模型的主题往往更宽泛且更混杂。然而,结构主题模型为推断性协变量分析提供更强支持,而BERTopic的协变量比较主要局限于描述性层面。这些发现表明,结构主题模型与BERTopic具有互补优势。最后,我们为应用社会科学研究中选择与组合主题建模方法提供了实践指导。