Two-sample testing tests whether the distributions generating two samples are identical. We pose the two-sample testing problem in a new scenario where the sample measurements (or sample features) are inexpensive to access, but their group memberships (or labels) are costly. We devise the first \emph{active sequential two-sample testing framework} that not only sequentially but also \emph{actively queries} sample labels to address the problem. Our test statistic is a likelihood ratio where one likelihood is found by maximization over all class priors, and the other is given by a classification model. The classification model is adaptively updated and then used to guide an active query scheme called bimodal query to label sample features in the regions with high dependency between the feature variables and the label variables. The theoretical contributions in the paper include proof that our framework produces an \emph{anytime-valid} $p$-value; and, under reachable conditions and a mild assumption, the framework asymptotically generates a minimum normalized log-likelihood ratio statistic that a passive query scheme can only achieve when the feature variable and the label variable have the highest dependence. Lastly, we provide a \emph{query-switching (QS)} algorithm to decide when to switch from passive query to active query and adapt bimodal query to increase the testing power of our test. Extensive experiments justify our theoretical contributions and the effectiveness of QS.
翻译:双样本检验旨在判断生成两个样本的分布是否相同。我们提出了一种新的双样本检验场景:样本测量值(或样本特征)的获取成本低廉,但其组别归属(或标签)的获取成本高昂。为此,我们设计了首个能够主动且序贯地查询样本标签以解决该问题的框架。我们的检验统计量是一个似然比,其中一项通过对所有类别先验进行最大化得到,另一项则由分类模型给出。分类模型被自适应更新,并用于指导一种称为双模态查询的主动查询方案,在特征变量与标签变量具有高依赖性的区域中标记样本特征。本文的理论贡献包括:证明我们的框架能够生成一个时刻有效的p值;在可达条件与一个温和假设下,该框架渐近地生成一个最小归一化对数似然比统计量,而被动查询方案仅在特征变量与标签变量具有最高依赖性时才能达到该水平。最后,我们提出了一种查询切换算法,用于决定何时从被动查询切换到主动查询,并自适应地采用双模态查询以提高检验效能。大量实验验证了我们的理论贡献以及查询切换算法的有效性。