Kernel techniques are among the most popular and powerful approaches of data science. Among the key features that make kernels ubiquitous are (i) the number of domains they have been designed for, (ii) the Hilbert structure of the function class associated to kernels facilitating their statistical analysis, and (iii) their ability to represent probability distributions without loss of information. These properties give rise to the immense success of Hilbert-Schmidt independence criterion (HSIC) which is able to capture joint independence of random variables under mild conditions, and permits closed-form estimators with quadratic computational complexity (w.r.t. the sample size). In order to alleviate the quadratic computational bottleneck in large-scale applications, multiple HSIC approximations have been proposed, however these estimators are restricted to $M=2$ random variables, do not extend naturally to the $M\ge 2$ case, and lack theoretical guarantees. In this work, we propose an alternative Nystr\"om-based HSIC estimator which handles the $M\ge 2$ case, prove its consistency, and demonstrate its applicability in multiple contexts, including synthetic examples, dependency testing of media annotations, and causal discovery.
翻译:核技术是数据科学中最流行且最强大的方法之一。核方法之所以广泛应用,关键在于其三大特性:(i) 适用于多种领域的设计能力,(ii) 与核函数关联的函数类具有希尔伯特空间结构,便于进行统计分析,(iii) 能够无损表示概率分布。这些特性使得希尔伯特-施密特独立性准则取得了巨大成功,该准则能在温和条件下捕捉随机变量的联合独立性,并提供具有二次计算复杂度(相对于样本量)的闭式估计量。为缓解大规模应用中的二次计算瓶颈,研究者提出了多种HSIC近似方法,但这些估计量仅限于 $M=2$ 个随机变量的情形,无法自然推广到 $M\ge 2$ 的情况,且缺乏理论保证。本文提出一种基于Nyström的替代性HSIC估计量,可处理 $M\ge 2$ 的情形,证明其一致性,并展示其在多个场景中的适用性,包括合成示例、媒体标注的依赖性检验和因果发现。