Topic models are a popular tool for understanding text collections, but their evaluation has been a point of contention. Automated evaluation metrics such as coherence are often used, however, their validity has been questioned for neural topic models (NTMs) and can overlook a models benefits in real world applications. To this end, we conduct the first evaluation of neural, supervised and classical topic models in an interactive task based setting. We combine topic models with a classifier and test their ability to help humans conduct content analysis and document annotation. From simulated, real user and expert pilot studies, the Contextual Neural Topic Model does the best on cluster evaluation metrics and human evaluations; however, LDA is competitive with two other NTMs under our simulated experiment and user study results, contrary to what coherence scores suggest. We show that current automated metrics do not provide a complete picture of topic modeling capabilities, but the right choice of NTMs can be better than classical models on practical task.
翻译:主题模型是理解文本集合的流行工具,但其评估一直存在争议。尽管常用连贯性等自动评估指标,但这些指标在神经主题模型(NTM)上的有效性受到质疑,且可能忽略模型在实际应用中的优势。为此,我们首次在基于交互式任务的场景中评估了神经主题模型、监督主题模型和经典主题模型。我们将主题模型与分类器相结合,测试其帮助人类进行内容分析和文档标注的能力。通过模拟实验、真实用户实验和专家试点研究,情境神经主题模型在聚类评估指标和人类评估中表现最佳;然而,在我们的模拟实验和用户研究结果中,LDA的表现与另外两种NTM不相上下,这与连贯性分数所揭示的结果相反。研究表明,当前的自动评估指标无法全面反映主题建模的能力,但在实际任务中,选择合适的NTM可能优于经典模型。