We consider the variable selection problem for two-sample tests, aiming to select the most informative variables to distinguish samples from two groups. To solve this problem, we propose a framework based on the kernel maximum mean discrepancy (MMD). Our approach seeks a group of variables with a pre-specified size that maximizes the variance-regularized MMD statistics. This formulation also corresponds to the minimization of asymptotic type-II error while controlling type-I error, as studied in the literature. We present mixed-integer programming formulations and offer exact and approximation algorithms with performance guarantees for linear and quadratic types of kernel functions. Experimental results demonstrate the superior performance of our framework.
翻译:我们考虑双样本检验中的变量选择问题,目标在于选取最具区分能力的变量以区分两组样本。为解决该问题,提出基于核最大均值差异(MMD)的框架。该方法旨在选取预设规模的变量子集,最大化方差正则化MMD统计量。该表述与文献中在控制第一类错误率的同时最小化渐近第二类错误率的目标相对应。我们构建了混合整数规划模型,针对线性与二次核函数类型,给出具有性能保障的精确算法与近似算法。实验结果表明所提框架具有优越性能。