A recent line of work in natural language processing has aimed to combine language models and topic models. These topic-guided language models augment neural language models with topic models, unsupervised learning methods that can discover document-level patterns of word use. This paper compares the effectiveness of these methods in a standardized setting. We study four topic-guided language models and two baselines, evaluating the held-out predictive performance of each model on four corpora. Surprisingly, we find that none of these methods outperform a standard LSTM language model baseline, and most fail to learn good topics. Further, we train a probe of the neural language model that shows that the baseline's hidden states already encode topic information. We make public all code used for this study.
翻译:自然语言处理领域近期一系列工作旨在将语言模型与主题模型相结合。这些主题引导语言模型通过引入主题模型——一种能够发现文档层面词汇使用模式的无监督学习方法——来增强神经语言模型。本文在标准化设置下比较了这些方法的有效性。我们研究了四种主题引导语言模型及两个基线模型,并在四个语料库上评估了各模型的留出预测性能。令人惊讶的是,我们发现这些方法均未能超越标准LSTM语言模型基线,且大多数方法未能学习到有效的主题。此外,我们对神经语言模型进行探针分析,结果表明基线的隐藏状态已编码了主题信息。我们公开了本研究中使用的全部代码。