Massive data analysis calls for distributed algorithms and theories. We design a multi-round distributed algorithm for canonical correlation analysis. We construct principal directions through the convex formulation of canonical correlation analysis and use the shift-and-invert preconditioning iteration to expedite the convergence rate. This distributed algorithm is communication-efficient. The resultant estimate achieves the same convergence rate as if all observations were pooled together, but does not impose stringent restrictions on the number of machines. We take a gap-free analysis to bypass the widely used yet unrealistic assumption of an explicit gap between the successive canonical correlations in the canonical correlation analysis. Extensive simulations and applications to three benchmark image data are conducted to demonstrate the empirical performance of our proposed algorithms and theories.
翻译:大规模数据分析需要分布式算法与理论支撑。本文设计了一种多轮分布式典型相关分析算法。我们通过典型相关分析的凸优化形式构建主方向,并采用平移-逆预条件迭代技术加速收敛速率。该分布式算法具有通信高效性。所得估计量实现了与全数据集中处理相同的收敛速率,且不对机器数量施加严格限制。我们采用无间隙分析方法,规避了典型相关分析中广泛使用但脱离实际的"典型相关系数存在显式间隔"假设。通过大量数值模拟及在三个基准图像数据集上的应用,验证了所提算法与理论的实证性能。