We consider the variable selection problem for two-sample tests, aiming to select the most informative variables to distinguish samples from two groups. To solve this problem, we propose a framework based on the kernel maximum mean discrepancy (MMD). Our approach seeks a group of variables with a pre-specified size that maximizes the variance-regularized MMD statistics. This formulation also corresponds to the minimization of asymptotic type-II error while controlling type-I error, as studied in the literature. We present mixed-integer programming formulations and develop exact and approximation algorithms with performance guarantees for different choices of kernel functions. Furthermore, we provide a statistical testing power analysis of our proposed framework. Experiment results on synthetic and real datasets demonstrate the superior performance of our approach.
翻译:我们考虑双样本检验中的变量选择问题,旨在挑选最具区分性的变量来识别两组样本间的差异。为解决该问题,我们提出基于核最大均值差异(MMD)的框架。该方法寻找一组预设规模、能最大化方差正则化MMD统计量的变量,其形式对应于文献中研究的在控制第一类错误的同时最小化渐近第二类错误。我们针对不同核函数给出混合整数规划形式,并开发具有性能保证的精确与近似算法。此外,我们对所提框架进行统计检验效能分析。在合成与真实数据集上的实验结果表明,该方法具有优越性能。