Trustworthy Dimensionality Reduction

from arxiv, This is a dissertation report presented for the completion of the degree of Master of Statistics (M. Stat.) of the author from Indian Statistical Institute

Different unsupervised models for dimensionality reduction like PCA, LLE, Shannon's mapping, tSNE, UMAP, etc. work on different principles, hence, they are difficult to compare on the same ground. Although they are usually good for visualisation purposes, they can produce spurious patterns that are not present in the original data, losing its trustability (or credibility). On the other hand, information about some response variable (or knowledge of class labels) allows us to do supervised dimensionality reduction such as SIR, SAVE, etc. which work to reduce the data dimension without hampering its ability to explain the particular response at hand. Therefore, the reduced dataset cannot be used to further analyze its relationship with some other kind of responses, i.e., it loses its generalizability. To make a better dimensionality reduction algorithm with a better balance between these two, we shall formally describe the mathematical model used by dimensionality reduction algorithms and provide two indices to measure these intuitive concepts such as trustability and generalizability. Then, we propose a Localized Skeletonization and Dimensionality Reduction (LSDR) algorithm which approximately achieves optimality in both these indices to some extent. The proposed algorithm has been compared with state-of-the-art algorithms such as tSNE and UMAP and is found to be better overall in preserving global structure while retaining useful local information as well. We also propose some of the possible extensions of LSDR which could make this algorithm universally applicable for various types of data similar to tSNE and UMAP.

翻译：不同的无监督降维模型（如PCA、LLE、Shannon映射、tSNE、UMAP等）基于不同原理工作，因此难以在同一基础上进行比较。虽然它们通常适用于可视化，但可能产生原始数据中不存在的虚假模式，从而丧失可信度（或可靠性）。另一方面，关于某些响应变量的信息（或对类标签的认知）允许我们进行监督降维（如SIR、SAVE等），这些方法在降低数据维度的同时不会削弱其解释特定响应的能力。因此，降维后的数据集无法进一步用于分析与其他类型响应的关系，即丧失了泛化性。为开发一种能在上述两方面取得更好平衡的降维算法，我们正式描述了降维算法所使用的数学模型，并提出了两个指标用于衡量可信度和泛化性等直观概念。随后，我们提出了一种局部骨架化与降维（LSDR）算法，该算法能在这两个指标上近似达到某种程度的全局最优。将所提算法与tSNE、UMAP等当前最优算法进行比较，发现其在保留全局结构的同时维持有用局部信息方面整体表现更优。我们还提出了LSDR的若干可能扩展方向，使其能像tSNE和UMAP一样适用于各类数据。