Topic models are useful tools for discovering latent semantic structures in large textual corpora. Topic modeling historically relied on bag-of-words representations of language. This approach makes models sensitive to the presence of stop words and noise, and does not utilize potentially useful contextual information. Recent efforts have been oriented at incorporating contextual neural representations in topic modeling and have been shown to outperform classical topic models. These approaches are, however, typically slow, volatile and still require preprocessing for optimal results. We present Semantic Signal Separation ($S^3$), a theory-driven topic modeling approach in neural embedding spaces. $S^3$ conceptualizes topics as independent axes of semantic space, and uncovers these with blind-source separation. Our approach provides the most diverse, highly coherent topics, requires no preprocessing, and is demonstrated to be the fastest contextually sensitive topic model to date. We offer an implementation of $S^3$, among other approaches, in the Turftopic Python package.
翻译:主题模型是用于发现大型文本语料库中潜在语义结构的有用工具。从历史上看,主题建模依赖于语言的词袋表示。这种方法使得模型对停用词和噪声的存在敏感,并且未能利用潜在有用的上下文信息。近期的研究致力于将上下文神经表示融入主题建模,并已被证明优于经典主题模型。然而,这些方法通常速度慢、稳定性差,并且仍需要预处理以获得最佳结果。我们提出了语义信号分离($S^3$),一种在神经嵌入空间中以理论驱动的主题建模方法。$S^3$将主题概念化为语义空间的独立轴,并通过盲源分离来揭示这些轴。我们的方法提供了最具多样性、高连贯性的主题,无需任何预处理,并被证明是迄今为止最快的上下文敏感主题模型。我们在Turftopic Python包中提供了$S^3$及其他方法的实现。