Two-sample testing decides whether two datasets are generated from the same distribution. This paper studies variable selection for two-sample testing, the task being to identify the variables (or dimensions) responsible for the discrepancies between the two distributions. This task is relevant to many problems of pattern analysis and machine learning, such as dataset shift adaptation, causal inference and model validation. Our approach is based on a two-sample test based on the Maximum Mean Discrepancy (MMD). We optimise the Automatic Relevance Detection (ARD) weights defined for individual variables to maximise the power of the MMD-based test. For this optimisation, we introduce sparse regularisation and propose two methods for dealing with the issue of selecting an appropriate regularisation parameter. One method determines the regularisation parameter in a data-driven way, and the other aggregates the results of different regularisation parameters. We confirm the validity of the proposed methods by systematic comparisons with baseline methods, and demonstrate their usefulness in exploratory analysis of high-dimensional traffic simulation data. Preliminary theoretical analyses are also provided, including a rigorous definition of variable selection for two-sample testing.
翻译:双样本检验旨在判断两个数据集是否来自同一分布。本文研究双样本检验中的变量选择问题,即识别导致两个分布存在差异的变量(或维度)。该任务与模式分析与机器学习的诸多问题相关,例如数据集偏移自适应、因果推断和模型验证。我们的方法基于最大均值差异(MMD)的双样本检验。我们优化为单个变量定义的自动相关性检测(ARD)权重,以最大化基于MMD的检验功效。针对该优化过程,我们引入稀疏正则化,并提出两种方法处理正则化参数选择问题:一种方法以数据驱动方式确定正则化参数,另一种则聚合不同正则化参数的结果。通过与基准方法的系统比较,我们验证了所提方法的有效性,并展示了其在高维交通仿真数据探索性分析中的实用性。此外,本文还提供了初步理论分析,包括对双样本检验中变量选择的严格定义。