Widely used pipelines for analyzing high-dimensional data utilize two-dimensional visualizations. These are created, for instance, via t-distributed stochastic neighbor embedding (t-SNE). A crucial element of the t-SNE embedding procedure is the perplexity hyperparameter. That is because the embedding structure varies when perplexity is changed. A suitable perplexity choice depends on the data set and the intended usage for the embedding. Therefore, perplexity is often chosen based on heuristics, intuition, and prior experience. This paper uncovers a linear relationship between perplexity and the data set size. Namely, we show that embeddings remain structurally consistent across data set samples when perplexity is adjusted accordingly. Qualitative and quantitative experimental results support these findings. This informs the visualization process, guiding the user when choosing a perplexity value. Finally, we outline several applications for the visualization of high-dimensional data via t-SNE based on this linear relationship.
翻译:广泛使用的高维数据分析流程常依赖于二维可视化方法,例如通过t分布随机邻域嵌入(t-SNE)生成可视化结果。t-SNE嵌入过程中的关键要素是困惑度超参数,因为嵌入结构会随困惑度的改变而变化。合适的困惑度选择取决于具体数据集和嵌入的预期用途,因此困惑度通常基于启发式方法、直观经验和先验知识来确定。本文揭示了困惑度与数据集规模之间的线性关系:研究表明,当相应调整困惑度时,嵌入结构在不同数据集样本间保持一致性。定性与定量实验结果均支持这一发现。这一规律为可视化过程提供了指导,有助于用户在选取困惑度值时作出决策。最后,基于该线性关系,我们提出了若干通过t-SNE实现高维数据可视化的应用方案。