In this paper, we explore the role of matrix scaling on a matrix of counts when building a topic model using non-negative matrix factorization. We present a scaling inspired by the normalized Laplacian (NL) for graphs that can greatly improve the quality of a non-negative matrix factorization. The results parallel those in the spectral graph clustering work of \cite{Priebe:2019}, where the authors proved adjacency spectral embedding (ASE) spectral clustering was more likely to discover core-periphery partitions and Laplacian Spectral Embedding (LSE) was more likely to discover affinity partitions. In text analysis non-negative matrix factorization (NMF) is typically used on a matrix of co-occurrence ``contexts'' and ``terms" counts. The matrix scaling inspired by LSE gives significant improvement for text topic models in a variety of datasets. We illustrate the dramatic difference a matrix scalings in NMF can greatly improve the quality of a topic model on three datasets where human annotation is available. Using the adjusted Rand index (ARI), a measure cluster similarity we see an increase of 50\% for Twitter data and over 200\% for a newsgroup dataset versus using counts, which is the analogue of ASE. For clean data, such as those from the Document Understanding Conference, NL gives over 40\% improvement over ASE. We conclude with some analysis of this phenomenon and some connections of this scaling with other matrix scaling methods.
翻译:本文探讨了在使用非负矩阵分解构建主题模型时,矩阵缩放对计数量矩阵的影响。我们提出了一种受图的归一化拉普拉斯矩阵(NL)启发的缩放方法,该方法能显著提升非负矩阵分解的质量。研究结果与《Priebe:2019》中谱图聚类的工作相呼应,该工作证明了邻接谱嵌入(ASE)谱聚类更可能发现核心-边缘分区,而拉普拉斯谱嵌入(LSE)更可能发现亲和分区。在文本分析中,非负矩阵分解(NMF)通常应用于共现“上下文”与“术语”计数量矩阵。受LSE启发的矩阵缩放能够在多种数据集上显著改善文本主题模型的质量。我们在三个具有人工标注的数据集上展示了非负矩阵分解中矩阵缩放对主题模型质量的戏剧性提升效果。通过调整兰德指数(ARI)这一聚类相似度度量,我们发现相对于作为ASE模拟的原始计数量方法,Twitter数据的ARI提升了50%,而新闻组数据集提升了200%以上。对于如文档理解会议(Document Understanding Conference)提供的干净数据,NL方法相较于ASE带来了超过40%的提升。最后,我们对这一现象进行了分析,并探讨了该缩放方法与其他矩阵缩放方法之间的联系。