Generalization error bounds from learning theory provide statistical guarantees on how well an algorithm will perform on previously unseen data. In this paper, we characterize the impacts of data non-IIDness due to censored feedback (a.k.a. selective labeling bias) on such bounds. We first derive an extension of the well-known Dvoretzky-Kiefer-Wolfowitz (DKW) inequality, which characterizes the gap between empirical and theoretical CDFs given IID data, to problems with non-IID data due to censored feedback. We then use this CDF error bound to provide a bound on the generalization error guarantees of a classifier trained on such non-IID data. We show that existing generalization error bounds (which do not account for censored feedback) fail to correctly capture the model's generalization guarantees, verifying the need for our bounds. We further analyze the effectiveness of (pure and bounded) exploration techniques, proposed by recent literature as a way to alleviate censored feedback, on improving our error bounds. Together, our findings illustrate how a decision maker should account for the trade-off between strengthening the generalization guarantees of an algorithm and the costs incurred in data collection when future data availability is limited by censored feedback.
翻译:学习理论中的泛化误差界为算法在未见数据上的表现提供了统计保证。本文刻画了因删失反馈(亦称选择性标注偏差)导致的数据非独立同分布性对此类误差界的影响。我们首先推导了著名的Dvoretzky-Kiefer-Wolfowitz(DKW)不等式的扩展形式,该不等式在独立同分布数据下刻画了经验累积分布函数与理论累积分布函数之间的差距,我们将其推广至因删失反馈导致非独立同分布数据的问题中。随后利用该累积分布函数误差界,对基于此类非独立同分布数据训练的分类器的泛化误差保证给出界约束。研究表明,现有未考虑删失反馈的泛化误差界无法正确捕捉模型的泛化保证,从而验证了我们所提出误差界的必要性。我们进一步分析了近期文献提出的(纯探索与有界探索)探索技术对改善误差界的效果,这类技术旨在缓解删失反馈的影响。综合而言,我们的研究结果揭示了当未来数据获取受删失反馈限制时,决策者应如何权衡算法泛化保证的强化与数据收集成本之间的关系。