A two-sample hypothesis test is a statistical procedure used to determine whether the distributions generating two samples are identical. We consider the two-sample testing problem in a new scenario where the sample measurements (or sample features) are inexpensive to access, but their group memberships (or labels) are costly. To address the problem, we devise the first \emph{active sequential two-sample testing framework} that not only sequentially but also \emph{actively queries}. Our test statistic is a likelihood ratio where one likelihood is found by maximization over all class priors, and the other is provided by a probabilistic classification model. The classification model is adaptively updated and used to predict where the (unlabelled) features have a high dependency on labels; labeling the ``high-dependency'' features leads to the increased power of the proposed testing framework. In theory, we provide the proof that our framework produces an \emph{anytime-valid} $p$-value. In addition, we characterize the proposed framework's gain in testing power by analyzing the mutual information between the feature and label variables in asymptotic and finite-sample scenarios. In practice, we introduce an instantiation of our framework and evaluate it using several experiments; the experiments on the synthetic, MNIST, and application-specific datasets demonstrate that the testing power of the instantiated active sequential test significantly increases while the Type I error is under control.
翻译:双样本假设检验是一种用于判断生成两个样本的分布是否相同的统计程序。我们考虑一种新的双样本检验场景:样本测量值(或样本特征)易于获取,但其组别归属(或标签)的获取成本高昂。为解决此问题,我们设计了首个**主动序贯双样本检验框架**,该框架不仅序贯地、而且**主动地**进行查询。我们的检验统计量为似然比,其中一个似然通过对所有类别先验进行最大化得到,另一个似然则由概率分类模型提供。该分类模型被自适应地更新,并用于预测哪些(未标记的)特征对标签具有高依赖性;标记这些"高依赖性"特征能够提升所提检验框架的检验功效。理论上,我们证明了该框架能够产生**任意时间有效的**$p$值。此外,我们通过分析特征变量与标签变量在渐近和有限样本情况下的互信息,刻画了所提框架在检验功效上的增益。在实际应用中,我们给出了该框架的一个具体实例,并通过多组实验进行评估;在合成数据集、MNIST数据集以及特定应用数据集上的实验表明,该主动序贯检验实例在控制第一类错误的同时,其检验功效显著提升。