These notes are an overview of some classical linear methods in Multivariate Data Analysis. This is a good old domain, well established since the 60's, and refreshed timely as a key step in statistical learning. It can be presented as part of statistical learning, or as dimensionality reduction with a geometric flavor. Both approaches are tightly linked: it is easier to learn patterns from data in low dimensional spaces than in high-dimensional spaces. It is shown how a diversity of methods and tools boil down to a single core methods, PCA with SVD, such that the efforts to optimize codes for analyzing massive data sets like distributed memory and task-based programming or to improve the efficiency of the algorithms like Randomised SVD can focus on this shared core method, and benefit to all methods.
翻译:本笔记概述了多元数据分析中的若干经典线性方法。该领域自20世纪60年代确立以来历久弥新,作为统计学习的关键步骤持续焕发活力。其既可视为统计学习的分支,亦可从几何维度解读为降维技术。两种视角紧密关联:在低维空间中从数据中学习模式比在高维空间更为便捷。研究表明,众多方法与工具均可归结为单一核心方法——基于奇异值分解的主成分分析。因此,针对海量数据集分析的代码优化工作(如分布式内存与任务型编程),以及算法效率改进(如随机化奇异值分解),均可聚焦于这一共享核心方法,从而惠及所有衍生技术。