In this work, we revisit the problem of active sequential prediction-powered mean estimation, where at each round one must decide the query probability of the ground-truth label upon observing the covariates of a sample. Furthermore, if the label is not queried, the prediction from a machine learning model is used instead. Prior work proposed an elegant scheme that determines the query probability by combining an uncertainty-based suggestion with a constant probability that encodes a soft constraint on the query probability. We explored different values of the mixing parameter and observed an intriguing empirical pattern: the smallest confidence width tends to occur when the weight on the constant probability is close to one, thereby reducing the influence of the uncertainty-based component. Motivated by this observation, we develop a non-asymptotic analysis of the estimator and establish a data-dependent bound on its confidence interval. Our analysis further suggests that when a no-regret learning approach is used to determine the query probability and control this bound, the query probability converges to the constraint of the max value of the query probability when it is chosen obliviously to the current covariates. We also conduct simulations that corroborate these theoretical findings.
翻译:在本工作中,我们重新审视了主动序贯预测驱动的均值估计问题:在每一轮中,观测到样本的协变量后,需要决定查询其真实标签的概率。若未查询标签,则转而使用机器学习模型的预测结果。此前的研究提出了一种优雅的方案,通过将基于不确定性的建议与编码查询概率软约束的恒定概率相结合来确定查询概率。我们探索了混合参数的不同取值,并观察到一个引人深思的经验模式:当恒定概率的权重接近1时,置信区间的宽度往往最小,从而削弱了不确定性分量的影响。基于这一观察,我们对该估计量进行了非渐近分析,并建立了其置信区间的数据依赖界。进一步分析表明,当采用无憾学习方法确定查询概率并控制该界时,若查询概率与当前协变量无关,其将收敛至最大可能值的约束。我们还通过仿真实验验证了这些理论发现。