We study the problems of sequential nonparametric two-sample and independence testing. Sequential tests process data online and allow using observed data to decide whether to stop and reject the null hypothesis or to collect more data while maintaining type I error control. We build upon the principle of (nonparametric) testing by betting, where a gambler places bets on future observations and their wealth measures evidence against the null hypothesis. While recently developed kernel-based betting strategies often work well on simple distributions, selecting a suitable kernel for high-dimensional or structured data, such as text and images, is often nontrivial. To address this drawback, we design prediction-based betting strategies that rely on the following fact: if a sequentially updated predictor starts to consistently determine (a) which distribution an instance is drawn from, or (b) whether an instance is drawn from the joint distribution or the product of the marginal distributions (the latter produced by external randomization), it provides evidence against the two-sample or independence nulls respectively. We empirically demonstrate the superiority of our tests over kernel-based approaches under structured settings. Our tests can be applied beyond the case of independent and identically distributed data, remaining valid and powerful even when the data distribution drifts over time.
翻译:我们研究了顺序非参数双样本检验与独立性检验的问题。顺序检验在线处理数据,允许利用观测数据决定是否停止并拒绝原假设,或收集更多数据,同时控制第一类错误。我们基于(非参数)押注检验原理构建方法,其中赌徒对未来观测值下注,其财富度量用于衡量拒绝原假设的证据。尽管近期开发的基于核函数的押注策略在简单分布上表现良好,但对于高维或结构化数据(如文本和图像),选择合适核函数往往具有挑战性。为解决这一缺陷,我们设计了基于预测的押注策略,其依赖于以下事实:若一个顺序更新的预测器开始一致地确定(a)实例来自哪个分布,或(b)实例来自联合分布还是边际分布的乘积(后者通过外部随机化生成),则它分别提供反对双样本原假设或独立性原假设的证据。我们通过实验证明,在结构化设置下,我们的检验优于基于核函数的方法。我们的检验可应用于独立同分布数据之外的情形,即使数据分布随时间漂移,也能保持有效性和统计效力。