Simultaneous Dimensionality Reduction: A Data Efficient Approach for Multimodal Representations Learning

We explore two primary classes of approaches to dimensionality reduction (DR): Independent Dimensionality Reduction (IDR) and Simultaneous Dimensionality Reduction (SDR). In IDR methods, of which Principal Components Analysis is a paradigmatic example, each modality is compressed independently, striving to retain as much variation within each modality as possible. In contrast, in SDR, one simultaneously compresses the modalities to maximize the covariation between the reduced descriptions while paying less attention to how much individual variation is preserved. Paradigmatic examples include Partial Least Squares and Canonical Correlations Analysis. Even though these DR methods are a staple of statistics, their relative accuracy and data set size requirements are poorly understood. We introduce a generative linear model to synthesize multimodal data with known variance and covariance structures to examine these questions. We assess the accuracy of the reconstruction of the covariance structures as a function of the number of samples, signal-to-noise ratio, and the number of varying and covarying signals in the data. Using numerical experiments, we demonstrate that linear SDR methods consistently outperform linear IDR methods and yield higher-quality, more succinct reduced-dimensional representations with smaller datasets. Remarkably, regularized CCA can identify low-dimensional weak covarying structures even when the number of samples is much smaller than the dimensionality of the data, which is a regime challenging for all dimensionality reduction methods. Our work corroborates and explains previous observations in the literature that SDR can be more effective in detecting covariation patterns in data. These findings suggest that SDR should be preferred to IDR in real-world data analysis when detecting covariation is more important than preserving variation.

翻译：我们探讨了降维（DR）的两类主要方法：独立降维（IDR）与同步降维（SDR）。在IDR方法中（以主成分分析为典型范例），每个模态被独立压缩，力求尽可能保留各模态内部的变异。相比之下，在SDR中，我们同步压缩多个模态，以最大化降维后描述间的协变，而对保留多少个体变异关注较少。典型范例包括偏最小二乘法和典型相关分析。尽管这些降维方法是统计学中的基础工具，但其相对准确性及所需数据集规模尚未得到充分理解。为此，我们引入一个生成式线性模型来合成具有已知方差与协方差结构的多模态数据，以探究上述问题。我们评估了协方差结构重建的准确性，并将其作为样本数量、信噪比以及数据中变异与协变信号数量的函数进行分析。通过数值实验，我们证明线性SDR方法在线性IDR方法中表现更优，能够利用更小的数据集获得更高质量、更简洁的降维表示。值得注意的是，正则化CCA即使在样本数量远小于数据维度的困难情况下，仍能识别低维的弱协变结构，这对所有降维方法而言均具挑战性。我们的工作证实并解释了文献中先前的观察：SDR在检测数据中的协变模式方面可能更为有效。这些发现表明，在实际数据分析中，当检测协变比保留变异更为重要时，应优先选择SDR而非IDR。