Dimension reduction (DR) algorithms have proven to be extremely useful for gaining insight into large-scale high-dimensional datasets, particularly finding clusters in transcriptomic data. The initial phase of these DR methods often involves converting the original high-dimensional data into a graph. In this graph, each edge represents the similarity or dissimilarity between pairs of data points. However, this graph is frequently suboptimal due to unreliable high-dimensional distances and the limited information extracted from the high-dimensional data. This problem is exacerbated as the dataset size increases. If we reduce the size of the dataset by selecting points for a specific sections of the embeddings, the clusters observed through DR are more separable since the extracted subgraphs are more reliable. In this paper, we introduce LocalMAP, a new dimensionality reduction algorithm that dynamically and locally adjusts the graph to address this challenge. By dynamically extracting subgraphs and updating the graph on-the-fly, LocalMAP is capable of identifying and separating real clusters within the data that other DR methods may overlook or combine. We demonstrate the benefits of LocalMAP through a case study on biological datasets, highlighting its utility in helping users more accurately identify clusters for real-world problems.
翻译:降维(DR)算法已被证明对于深入理解大规模高维数据集极为有用,尤其在转录组数据中发现聚类方面。这些降维方法的初始阶段通常涉及将原始高维数据转换为图。在此图中,每条边表示数据点对之间的相似性或相异性。然而,由于高维距离的不可靠性以及从高维数据中提取信息的有限性,该图往往并非最优。随着数据集规模的增大,这一问题会进一步加剧。若通过为嵌入的特定部分选择数据点来减小数据集规模,则通过降维观察到的聚类更具可分性,因为提取的子图更为可靠。本文提出一种新的降维算法LocalMAP,该算法通过动态局部调整图表以应对这一挑战。通过动态提取子图并实时更新图表,LocalMAP能够识别并分离数据中真实存在的聚类,这些聚类可能被其他降维方法忽略或合并。我们通过生物数据集的案例研究展示了LocalMAP的优势,突显其在帮助用户更准确识别现实问题中的聚类方面的实用价值。