For a bucket test with a single criterion for success and a fixed number of samples or testing period, requiring a $p$-value less than a specified value of $\alpha$ for the success criterion produces statistical confidence at level $1 - \alpha$. For multiple criteria, a Bonferroni correction that partitions $\alpha$ among the criteria produces statistical confidence, at the cost of requiring lower $p$-values for each criterion. The same concept can be applied to decisions about early stopping, but that can lead to strict requirements for $p$-values. We show how to address that challenge by requiring criteria to be successful at multiple decision points.
翻译:对于具有单一成功标准和固定样本量或测试周期的桶测试,要求成功标准的$p$值小于指定的$\alpha$值可在$1 - \alpha$水平上产生统计置信度。对于多重标准,将$\alpha$在各标准间分配的Bonferroni校正虽能产生统计置信度,但需以每个标准更低的$p$值为代价。这一概念同样适用于早停决策,但可能导致对$p$值的严苛要求。我们通过要求标准在多个决策点均达到成功来应对这一挑战。