Corruption is frequently observed in collected data and has been extensively studied in machine learning under different corruption models. Despite this, there remains a limited understanding of how these models relate such that a unified view of corruptions and their consequences on learning is still lacking. In this work, we formally analyze corruption models at the distribution level through a general, exhaustive framework based on Markov kernels. We highlight the existence of intricate joint and dependent corruptions on both labels and attributes, which are rarely touched by existing research. Further, we show how these corruptions affect standard supervised learning by analyzing the resulting changes in Bayes Risk. Our findings offer qualitative insights into the consequences of "more complex" corruptions on the learning problem, and provide a foundation for future quantitative comparisons. Applications of the framework include corruption-corrected learning, a subcase of which we study in this paper by theoretically analyzing loss correction with respect to different corruption instances.
翻译:数据污染在收集的数据中频繁出现,并已在机器学习中针对不同污染模型得到广泛研究。尽管如此,对于这些模型之间的关联仍缺乏深入理解,导致对污染及其对学习影响的统一视角仍然缺失。在本工作中,我们通过基于马尔可夫核的通用、穷尽性框架,从分布层面对污染模型进行了形式化分析。我们揭示了标签和属性上存在的复杂联合与依赖污染——这类情况在现有研究中鲜有涉及。进一步,我们通过分析由此引发的贝叶斯风险变化,展示了这些污染如何影响标准监督学习。我们的发现为“更复杂”污染对学习问题的影响提供了定性洞见,并为未来定量比较奠定了基础。该框架的应用包括污染修正学习,本文针对其子类——通过在理论上分析不同污染实例下的损失修正——进行了研究。