The t-Distributed Stochastic Neighbor Embedding (t-SNE) has emerged as a popular dimensionality reduction technique for visualizing high-dimensional data. It computes pairwise similarities between data points by default using an RBF kernel and random initialization (in low-dimensional space), which successfully captures the overall structure but may struggle to preserve the local structure efficiently. This research proposes a novel approach called the Modified Isolation Kernel (MIK) as an alternative to the Gaussian kernel, which is built upon the concept of the Isolation Kernel. MIK uses adaptive density estimation to capture local structures more accurately and integrates robustness measures. It also assigns higher similarity values to nearby points and lower values to distant points. Comparative research using the normal Gaussian kernel, the isolation kernel, and several initialization techniques, including random, PCA, and random walk initializations, are used to assess the proposed approach (MIK). Additionally, we compare the computational efficiency of all $3$ kernels with $3$ different initialization methods. Our experimental results demonstrate several advantages of the proposed kernel (MIK) and initialization method selection. It exhibits improved preservation of the local and global structure and enables better visualization of clusters and subclusters in the embedded space. These findings contribute to advancing dimensionality reduction techniques and provide researchers and practitioners with an effective tool for data exploration, visualization, and analysis in various domains.
翻译:t分布随机邻域嵌入(t-SNE)已成为高维数据可视化中广泛应用的降维技术。该方法默认使用RBF核与随机初始化(在低维空间中)计算数据点间的成对相似度,虽能有效捕捉整体结构,但在高效保持局部结构方面可能存在不足。本研究提出一种称为改进型隔离核(MIK)的新方法,以替代基于隔离核概念构建的高斯核。MIK采用自适应密度估计以更精确地捕捉局部结构,并整合了鲁棒性度量机制,同时为邻近点分配更高的相似度值,为远端点分配更低的相似度值。本研究通过对比标准高斯核、隔离核及多种初始化技术(包括随机初始化、PCA初始化和随机游走初始化)对提出的MIK方法进行评估。此外,我们比较了全部$3$种核函数与$3$种不同初始化方法的计算效率。实验结果表明,所提出的MIK核函数与初始化方法具有多重优势:在嵌入空间中能更好地保持局部与全局结构,并实现更清晰的簇与子簇可视化。这些发现有助于推动降维技术的发展,为各领域研究者与实践者提供数据探索、可视化与分析的有效工具。