We consider the variable selection problem for two-sample tests, aiming to select the most informative features to best distinguish samples from two groups. We propose a kernel maximum mean discrepancy (MMD) framework to solve this problem and further derive its equivalent mixed-integer programming formulations for linear, quadratic, and Gaussian types of kernel functions. Our proposed framework admits advantages of both computational efficiency and nice statistical properties: (i) A closed-form solution is provided for the linear kernel case. Despite NP-hardness, we provide an exact mixed-integer semi-definite programming formulation for the quadratic kernel case, which further motivates the development of exact and approximation algorithms. We propose a convex-concave procedure that finds critical points for the Gaussian kernel case. (ii) We provide non-asymptotic uncertainty quantification of our proposed formulation under null and alternative scenarios. Experimental results demonstrate good performance of our framework.
翻译:我们考虑两样本检验中的变量选择问题,旨在选取最具区分度的特征以最佳方式区分两组样本。为此提出基于核最大均值差异(MMD)的框架,并进一步推导出针对线性核、二次核和高斯核函数的等价混合整数规划形式。该框架兼具计算效率与优良统计性质:(i) 针对线性核情形给出闭式解;尽管二次核情形具有NP难特性,我们仍为其建立了精确的混合整数半定规划模型,进而催生精确算法与近似算法的开发;对于高斯核情形,提出可找到关键点的凸凹过程。(ii) 在零假设与备择假设下,我们为非渐近不确定性量化提供了理论保障。实验结果表明该框架具有良好性能。