Principal Component Analysis (PCA) is one of the most used tools for extracting low-dimensional representations of data, in particular for time series. Performances are known to strongly depend on the quality (amount of noise) and the quantity of data. We here investigate the impact of heterogeneities, often present in real data, on the reconstruction of low-dimensional trajectories and of their associated modes. We focus in particular on the effects of sample-to-sample fluctuations and of component-dependent temporal convolution and noise in the measurements. We derive analytical predictions for the error on the reconstructed trajectory and the confusion between the modes using the replica method in a high-dimensional setting, in which the number and the dimension of the data are comparable. We find in particular that sample-to-sample variability, is deleterious for the reconstruction of the signal trajectory, but beneficial for the inference of the modes, and that the fluctuations in the temporal convolution kernels prevent perfect recovery of the latent modes even for very weak measurement noise. Our predictions are corroborated by simulations with synthetic data for a variety of control parameters.
翻译:主成分分析(PCA)是最常用于提取数据低维表示的工具之一,尤其适用于时间序列分析。其性能已知强烈依赖于数据质量(噪声水平)与数据量。本文研究了实际数据中普遍存在的异构性对低维轨迹及其关联模态重建的影响。我们特别关注样本间波动、分量相关的时间卷积效应以及测量噪声对重建过程的作用。在高维数据框架下(数据数量与维度相当),我们运用复本方法推导了重建轨迹误差与模态混淆程度的解析预测。研究发现:样本间变异性虽对信号轨迹重建具有负面影响,却有利于模态推断;而时间卷积核的波动会阻碍潜在模态的完美恢复,即使在测量噪声极弱的情况下亦是如此。我们通过多种控制参数的合成数据仿真验证了上述理论预测。