We study the problems of sequential nonparametric two-sample and independence testing. Sequential tests process data online and allow using observed data to decide whether to stop and reject the null hypothesis or to collect more data, while maintaining type I error control. We build upon the principle of (nonparametric) testing by betting, where a gambler places bets on future observations and their wealth measures evidence against the null hypothesis. While recently developed kernel-based betting strategies often work well on simple distributions, selecting a suitable kernel for high-dimensional or structured data, such as images, is often nontrivial. To address this drawback, we design prediction-based betting strategies that rely on the following fact: if a sequentially updated predictor starts to consistently determine (a) which distribution an instance is drawn from, or (b) whether an instance is drawn from the joint distribution or the product of the marginal distributions (the latter produced by external randomization), it provides evidence against the two-sample or independence nulls respectively. We empirically demonstrate the superiority of our tests over kernel-based approaches under structured settings. Our tests can be applied beyond the case of independent and identically distributed data, remaining valid and powerful even when the data distribution drifts over time.
翻译:我们研究了顺序非参数两样本检验与独立性检验问题。顺序检验可在线处理数据,并允许利用观测数据决定是停止检验并拒绝原假设,还是继续收集更多数据,同时保持第一类错误控制。我们基于(非参数)"赌注检验"原理构建方法,其中赌徒对未来观测值下注,其财富度量则反映反对原假设的证据。尽管近期基于核函数的赌注策略在简单分布上通常表现良好,但对于高维或结构化数据(如图像),选择合适的核函数往往颇有难度。为解决这一缺陷,我们设计了基于预测的赌注策略,该策略依赖于以下事实:若一个顺序更新的预测器开始持续确定(a)某个实例来自哪个分布,或(b)某个实例是来自联合分布还是边缘分布的乘积(后者通过外部随机化生成),则分别提供反对两样本原假设或独立性原假设的证据。我们通过实验证明,在结构化设定下,我们的检验方法优于基于核函数的方法。我们的检验可推广至独立同分布数据之外的情形,即使数据分布随时间漂移,仍能保持有效性和强检验力。