Topic models are a popular tool for understanding text collections, but their evaluation has been a point of contention. Automated evaluation metrics such as coherence are often used, however, their validity has been questioned for neural topic models (NTMs) and can overlook the benefits of a model in real world applications. To this end, we conduct the first evaluation of neural, supervised and classical topic models in an interactive task based setting. We combine topic models with a classifier and test their ability to help humans conduct content analysis and document annotation. From simulated, real user and expert pilot studies, the Contextual Neural Topic Model does the best on cluster evaluation metrics and human evaluations; however, LDA is competitive with two other NTMs under our simulated experiment and user study results, contrary to what coherence scores suggest. We show that current automated metrics do not provide a complete picture of topic modeling capabilities, but the right choice of NTMs can be better than classical models on practical tasks.
翻译:主题模型是理解文本集合的流行工具,但其评估方法一直存在争议。尽管连贯性等自动评估指标常被使用,但其对神经主题模型(NTMs)的有效性已受到质疑,且可能忽略模型在实际应用中的优势。为此,我们首次在基于交互式任务的场景中对神经主题模型、监督主题模型及经典主题模型进行了评估。我们将主题模型与分类器结合,测试其辅助人类进行内容分析和文档标注的能力。通过模拟实验、真实用户实验及专家试点研究,语境神经主题模型在聚类评估指标和人工评估中表现最佳;然而,在我们的模拟实验和用户研究结果中,LDA与其他两种NTM模型相比具有竞争力,这与连贯性分数所呈现的结论相悖。研究表明,当前自动评估指标无法全面反映主题建模能力,而在实际任务中,选择合适的NTM可能优于经典模型。