The semantic similarity between documents of a text corpus can be visualized using map-like metaphors based on two-dimensional scatterplot layouts. These layouts result from a dimensionality reduction on the document-term matrix or a representation within a latent embedding, including topic models. Thereby, the resulting layout depends on the input data and hyperparameters of the dimensionality reduction and is therefore affected by changes in them. Furthermore, the resulting layout is affected by changes in the input data and hyperparameters of the dimensionality reduction. However, such changes to the layout require additional cognitive efforts from the user. In this work, we present a sensitivity study that analyzes the stability of these layouts concerning (1) changes in the text corpora, (2) changes in the hyperparameter, and (3) randomness in the initialization. Our approach has two stages: data measurement and data analysis. First, we derived layouts for the combination of three text corpora and six text embeddings and a grid-search-inspired hyperparameter selection of the dimensionality reductions. Afterward, we quantified the similarity of the layouts through ten metrics, concerning local and global structures and class separation. Second, we analyzed the resulting 42817 tabular data points in a descriptive statistical analysis. From this, we derived guidelines for informed decisions on the layout algorithm and highlight specific hyperparameter settings. We provide our implementation as a Git repository at https://github.com/hpicgs/Topic-Models-and-Dimensionality-Reduction-Sensitivity-Study and results as Zenodo archive at https://doi.org/10.5281/zenodo.12772898.
翻译:文本语料库中文档间的语义相似性,可通过基于二维散点图布局的地图式隐喻进行可视化。这些布局源自对文档-词矩阵的降维处理,或基于潜在嵌入(包括主题模型)的表示。因此,所得布局取决于输入数据及降维过程的超参数,并随其变化而受影响。此外,布局亦受输入数据与降维超参数变动的影响。然而,此类布局变化会增加用户的认知负担。本研究提出一项敏感性分析,旨在探究这些布局在以下三方面的稳定性:(1) 文本语料库的变动,(2) 超参数的调整,以及(3) 初始化过程的随机性。我们的方法包含两个阶段:数据测量与数据分析。首先,我们针对三种文本语料库与六种文本嵌入的组合,通过网格搜索式超参数选择进行降维,生成对应布局。随后,我们通过十项指标量化布局间的相似性,涵盖局部与全局结构以及类别分离度。其次,我们对所得的42817个表格数据点进行描述性统计分析。基于此,我们提出了关于布局算法选择的指导原则,并重点指出了特定的超参数设置。相关实现代码已发布于Git仓库(https://github.com/hpicgs/Topic-Models-and-Dimensionality-Reduction-Sensitivity-Study),完整结果存档于Zenodo(https://doi.org/10.5281/zenodo.12772898)。