Widely used pipelines for the analysis of high-dimensional data utilize two-dimensional visualizations. These are created, e.g., via t-distributed stochastic neighbor embedding (t-SNE). When it comes to large data sets, applying these visualization techniques creates suboptimal embeddings, as the hyperparameters are not suitable for large data. Cranking up these parameters usually does not work as the computations become too expensive for practical workflows. In this paper, we argue that a sampling-based embedding approach can circumvent these problems. We show that hyperparameters must be chosen carefully, depending on the sampling rate and the intended final embedding. Further, we show how this approach speeds up the computation and increases the quality of the embeddings.
翻译:高维数据分析的常用流程依赖于二维可视化。这些可视化通过t分布随机邻域嵌入(t-SNE)等方法生成。在处理大规模数据集时,由于超参数不适用于大数据场景,应用这些可视化技术会产生次优嵌入。简单调高参数通常不可行,因为计算量会变得过于昂贵,难以适应实际工作流程。本文提出,基于采样的嵌入方法可规避这些问题。我们论证了超参数必须根据采样率和最终目标嵌入谨慎选择。进一步,我们展示了该方法如何加速计算并提升嵌入质量。