Independence testing is a classical statistical problem that has been extensively studied in the batch setting when one fixes the sample size before collecting data. However, practitioners often prefer procedures that adapt to the complexity of a problem at hand instead of setting sample size in advance. Ideally, such procedures should (a) stop earlier on easy tasks (and later on harder tasks), hence making better use of available resources, and (b) continuously monitor the data and efficiently incorporate statistical evidence after collecting new data, while controlling the false alarm rate. Classical batch tests are not tailored for streaming data: valid inference after data peeking requires correcting for multiple testing which results in low power. Following the principle of testing by betting, we design sequential kernelized independence tests that overcome such shortcomings. We exemplify our broad framework using bets inspired by kernelized dependence measures, e.g., the Hilbert-Schmidt independence criterion. Our test is also valid under non-i.i.d., time-varying settings. We demonstrate the power of our approaches on both simulated and real data.
翻译:独立性检验是一个经典的统计问题,在批处理场景中当数据收集前固定样本量时已被广泛研究。然而,实践者通常更倾向于能根据问题复杂度自适应调整的程序,而非预先设定样本量。理想情况下,此类程序应具备以下特性:(a) 在简单任务上提前终止(在困难任务上延迟终止),从而更高效地利用可用资源;(b) 持续监测数据并在收集新数据后有效整合统计证据,同时控制虚警率。经典批处理检验不适用于流式数据:数据窥探后的有效推断需要多重比较校正,这会导致统计功效降低。遵循"以赌注检验"原则,我们设计了能够克服上述缺陷的序贯核化独立性检验。通过采用受核化依赖度量(如希尔伯特-施密特独立性准则)启发的赌注方案,我们展示了该通用框架的效用。该检验在非独立同分布及时间动态变化场景下同样有效。我们在模拟数据和真实数据上验证了所提方法的统计功效。