We demonstrate that learning procedures that rely on aggregated labels, e.g., label information distilled from noisy responses, enjoy robustness properties impossible without data cleaning. This robustness appears in several ways. In the context of risk consistency -- when one takes the standard approach in machine learning of minimizing a surrogate (typically convex) loss in place of a desired task loss (such as the zero-one mis-classification error) -- procedures using label aggregation obtain stronger consistency guarantees than those even possible using raw labels. And while classical statistical scenarios of fitting perfectly-specified models suggest that incorporating all possible information -- modeling uncertainty in labels -- is statistically efficient, consistency fails for ``standard'' approaches as soon as a loss to be minimized is even slightly mis-specified. Yet procedures leveraging aggregated information still converge to optimal classifiers, highlighting how incorporating a fuller view of the data analysis pipeline, from collection to model-fitting to prediction time, can yield a more robust methodology by refining noisy signals.
翻译:我们证明,依赖于聚合标签(例如从噪声响应中提炼出的标签信息)的学习过程具有未经数据清洗则无法实现的鲁棒性。这种鲁棒性体现在多个方面。在风险一致性背景下——当采用机器学习中最小化代理损失(通常为凸损失)以替代期望任务损失(如0-1分类错误)的标准方法时——使用标签聚合的方法能获得比使用原始标签更强的一致性保证,后者甚至无法达到同等保证水平。尽管经典统计场景中拟合完全指定模型的理论表明,纳入所有可能信息(包括标签不确定性建模)具有统计效率,但一旦待最小化的损失函数存在轻微误设定,“标准”方法的一致性就会失效。然而,利用聚合信息的方法仍能收敛至最优分类器,这凸显了通过整合从数据收集到模型拟合再到预测阶段的全流程视角,能够通过精炼噪声信号构建更具鲁棒性的方法论。