This study introduces Bidirectional Topic Matching (BTM), a novel method for cross-corpus topic modeling that quantifies thematic overlap and divergence between corpora. BTM is a flexible framework that can incorporate various topic modeling approaches, including BERTopic, Top2Vec, and Latent Dirichlet Allocation (LDA). BTM employs a dual-model approach, training separate topic models for each corpus and applying them reciprocally to enable comprehensive cross-corpus comparisons. This methodology facilitates the identification of shared themes and unique topics, providing nuanced insights into thematic relationships. Validation against cosine similarity-based methods demonstrates the robustness of BTM, with strong agreement metrics and distinct advantages in handling outlier topics. A case study on climate news articles showcases BTM's utility, revealing significant thematic overlaps and distinctions between corpora focused on climate change and climate action. BTM's flexibility and precision make it a valuable tool for diverse applications, from political discourse analysis to interdisciplinary studies. By integrating shared and unique topic analyses, BTM offers a comprehensive framework for exploring thematic relationships, with potential extensions to multilingual and dynamic datasets. This work highlights BTM's methodological contributions and its capacity to advance discourse analysis across various domains.
翻译:本研究提出双向主题匹配(BTM)这一新颖的跨语料库主题建模方法,用于量化语料库间的主题重叠与差异。BTM是一个灵活框架,可整合多种主题建模方法,包括BERTopic、Top2Vec和潜在狄利克雷分布(LDA)。该方法采用双模型策略:分别为每个语料库训练独立主题模型,并通过相互应用实现全面的跨语料库比较。该框架有助于识别共享主题与独特主题,从而深入揭示主题间的复杂关联。与基于余弦相似度的方法相比,验证结果表明BTM具有更强的鲁棒性,其一致性度量表现优异,且在处理离群主题方面具有独特优势。通过对气候新闻文章的案例研究,展示了BTM在揭示气候变化与气候行动两类语料库间显著主题重叠与差异方面的实用价值。BTM的灵活性与精确性使其成为从政治话语分析到跨学科研究等多种应用场景的有力工具。通过整合共享主题与独特主题分析,BTM为探索主题关系提供了完整框架,并具备扩展到多语言及动态数据集的潜力。本研究凸显了BTM的方法学贡献及其在推动跨领域话语分析方面的应用前景。