Quantifying the dependence between high-dimensional random variables is central to statistical learning and inference. Two classical methods are canonical correlation analysis (CCA), which identifies maximally correlated projected versions of the original variables, and Shannon's mutual information, which is a universal dependence measure that also captures high-order dependencies. However, CCA only accounts for linear dependence, which may be insufficient for certain applications, while mutual information is often infeasible to compute/estimate in high dimensions. This work proposes a middle ground in the form of a scalable information-theoretic generalization of CCA, termed max-sliced mutual information (mSMI). mSMI equals the maximal mutual information between low-dimensional projections of the high-dimensional variables, which reduces back to CCA in the Gaussian case. It enjoys the best of both worlds: capturing intricate dependencies in the data while being amenable to fast computation and scalable estimation from samples. We show that mSMI retains favorable structural properties of Shannon's mutual information, like variational forms and identification of independence. We then study statistical estimation of mSMI, propose an efficiently computable neural estimator, and couple it with formal non-asymptotic error bounds. We present experiments that demonstrate the utility of mSMI for several tasks, encompassing independence testing, multi-view representation learning, algorithmic fairness, and generative modeling. We observe that mSMI consistently outperforms competing methods with little-to-no computational overhead.
翻译:量化高维随机变量之间的依赖关系是统计学习与推断的核心问题。两种经典方法是典型相关分析(CCA)和香农互信息:前者识别原始变量投影后相关性最强的成分,后者作为通用依赖度量能够捕捉高阶依赖。然而,CCA仅考虑线性依赖,在某些应用中可能不足,而互信息在高维场景下往往难以计算/估计。本文提出一种可扩展的信息论泛化方法——最大切片互信息(mSMI),作为两类方法的折中。mSMI定义为高维变量低维投影之间的最大互信息,在高斯情形下退化为CCA。该方法兼具两者优势:既能捕捉数据中的复杂依赖关系,又支持快速计算和基于样本的可扩展估计。我们证明mSMI保留了香农互信息的优良结构性质(如变分形式和独立性判别能力),随后研究其统计估计方法,提出一种高效可计算的神经估计器,并给出严格的非渐近误差界。实验表明,mSMI在独立性检验、多视角表示学习、算法公平性和生成建模等任务中具有实用价值。观察发现,mSMI在几乎不增加计算开销的情况下,始终优于现有方法。