We consider the variable selection problem for two-sample tests, aiming to select the most informative variables to determine whether two collections of samples follow the same distribution. To address this, we propose a novel framework based on the kernel maximum mean discrepancy (MMD). Our approach seeks a subset of variables with a pre-specified size that maximizes the variance-regularized kernel MMD statistic. We focus on three commonly used types of kernels: linear, quadratic, and Gaussian. From a computational perspective, we derive mixed-integer programming formulations and propose exact and approximation algorithms with performance guarantees to solve these formulations. From a statistical viewpoint, we derive the rate of testing power of our framework under appropriate conditions. These results show that the sample size requirements for the three kernels depend crucially on the number of selected variables, rather than the data dimension. Experimental results on synthetic and real datasets demonstrate the superior performance of our method, compared to other variable selection frameworks, particularly in high-dimensional settings.
翻译:本文研究双样本检验中的变量选择问题,旨在筛选出最具信息量的变量以判断两个样本集合是否服从同一分布。为此,我们提出一种基于核最大均值差异(MMD)的新框架。该方法寻找在给定规模下能够最大化方差正则化核MMD统计量的变量子集。我们聚焦于三种常用核函数:线性核、二次核与高斯核。从计算角度,我们推导了混合整数规划模型,并提出了具有性能保证的精确算法与近似算法来求解这些模型。从统计角度,我们在适当条件下推导了该框架检验功效的收敛速率。结果表明,三种核函数所需的样本量关键取决于所选变量个数,而非数据维度。在合成数据集与真实数据集上的实验表明,相较于其他变量选择框架,本方法在高维场景下具有更优越的性能。