Topic models are useful tools for discovering latent semantic structures in large textual corpora. Topic modeling historically relied on bag-of-words representations of language. This approach makes models sensitive to the presence of stop words and noise, and does not utilize potentially useful contextual information. Recent efforts have been oriented at incorporating contextual neural representations in topic modeling and have been shown to outperform classical topic models. These approaches are, however, typically slow, volatile and still require preprocessing for optimal results. We present Semantic Signal Separation ($S^3$), a theory-driven topic modeling approach in neural embedding spaces. $S^3$ conceptualizes topics as independent axes of semantic space, and uncovers these with blind-source separation. Our approach provides the most diverse, highly coherent topics, requires no preprocessing, and is demonstrated to be the fastest contextually sensitive topic model to date. We offer an implementation of $S^3$, among other approaches, in the Turftopic Python package.
翻译:主题模型是发现大规模文本语料库中潜在语义结构的有用工具。历史上,主题建模依赖于语言的词袋表示。这种方法使模型对停用词和噪声的存在敏感,且未能利用可能有益的上下文信息。近期的研究致力于将上下文神经表示融入主题建模,并已被证明优于经典主题模型。然而,这些方法通常速度较慢、稳定性欠佳,且仍需要预处理以获得最佳结果。本文提出语义信号分离($S^3$),一种基于理论驱动的、在神经嵌入空间中进行主题建模的方法。$S^3$ 将主题概念化为语义空间的独立轴,并通过盲源分离来揭示这些轴。我们的方法提供了最具多样性、高内聚性的主题,无需任何预处理,并被证明是迄今为止最快的上下文敏感主题模型。我们在 Turftopic Python 包中提供了 $S^3$ 及其他方法的实现。