Simultaneous Dimensionality Reduction: A Data Efficient Approach for Multimodal Representations Learning

We explore two primary classes of approaches to dimensionality reduction (DR): Independent Dimensionality Reduction (IDR) and Simultaneous Dimensionality Reduction (SDR). In IDR methods, of which Principal Components Analysis is a paradigmatic example, each modality is compressed independently, striving to retain as much variation within each modality as possible. In contrast, in SDR, one simultaneously compresses the modalities to maximize the covariation between the reduced descriptions while paying less attention to how much individual variation is preserved. Paradigmatic examples include Partial Least Squares and Canonical Correlations Analysis. Even though these DR methods are a staple of statistics, their relative accuracy and data set size requirements are poorly understood. We introduce a generative linear model to synthesize multimodal data with known variance and covariance structures to examine these questions. We assess the accuracy of the reconstruction of the covariance structures as a function of the number of samples, signal-to-noise ratio, and the number of varying and covarying signals in the data. Using numerical experiments, we demonstrate that linear SDR methods consistently outperform linear IDR methods and yield higher-quality, more succinct reduced-dimensional representations with smaller datasets. Remarkably, regularized CCA can identify low-dimensional weak covarying structures even when the number of samples is much smaller than the dimensionality of the data, which is a regime challenging for all dimensionality reduction methods. Our work corroborates and explains previous observations in the literature that SDR can be more effective in detecting covariation patterns in data. These findings suggest that SDR should be preferred to IDR in real-world data analysis when detecting covariation is more important than preserving variation.

翻译：我们探索了两种主要的降维（DR）方法：独立降维（IDR）和同步降维（SDR）。在IDR方法中，主成分分析是典型范例，每种模态独立压缩，力求保留尽可能多的模态内部变异。相比之下，在SDR中，我们同时压缩多种模态，以最大化降维描述之间的协变性，同时较少关注个体变异的保留程度。典型范例包括偏最小二乘法和典型相关分析。尽管这些DR方法是统计学的基础工具，但其相对准确性和所需数据集规模尚不明确。我们引入了一个生成线性模型来合成具有已知方差和协方差结构的多模态数据，以探究这些问题。我们评估了协方差结构重建的准确性，并将其视为样本数量、信噪比以及数据中变异和协变信号数量的函数。通过数值实验，我们证明线性SDR方法始终优于线性IDR方法，并且能够用更小的数据集生成更高质量、更简洁的降维表示。值得注意的是，正则化CCA可以在样本数量远小于数据维度的条件下识别出低维弱协变结构，而这一条件对所有降维方法都极具挑战性。我们的工作证实并解释了以往文献中的观察结果，即SDR在检测数据协变模式方面更为有效。这些发现表明，在实际数据分析中，当检测协变性比保留变异性更为重要时，应优先采用SDR而非IDR方法。