The problem of corrupted data, missing features, or missing modalities continues to plague the modern machine learning landscape. To address this issue, a class of regularization methods that enforce consistency between imputed and fully observed data has emerged as a promising approach for improving model generalization, particularly in partially observed settings. We refer to this class of methods as Measure Consistency Regularization (MCR). Despite its empirical success in various applications, such as image inpainting, data imputation and semi-supervised learning, a fundamental understanding of the theoretical underpinnings of MCR remains limited. This paper bridges this gap by offering theoretical insights into why, when, and how MCR enhances imputation quality under partial observability, viewed through the lens of neural network distance. Our theoretical analysis identifies the term responsible for MCR's generalization advantage and extends to the imperfect training regime, demonstrating that this advantage is not always guaranteed. Guided by these insights, we propose a novel training protocol that monitors the duality gap to determine an early stopping point that preserves the generalization benefit. We then provide detailed empirical evidence to support our theoretical claims and to show the effectiveness and accuracy of our proposed stopping condition. We further provide a set of real-world data simulations to show the versatility of MCR under different model architectures designed for different data sources.
翻译:数据损坏、特征缺失或模态缺失的问题持续困扰着现代机器学习领域。为解决这一问题,一类强制要求插补数据与完全观测数据之间保持一致性的正则化方法,已成为提升模型泛化能力(尤其是在部分观测场景下)的一种有前景的途径。我们将此类方法称为测度一致性正则化。尽管其在图像修复、数据插补和半监督学习等多种应用中取得了实证成功,但对于MCR理论基础的深入理解仍然有限。本文通过神经网络距离的视角,从理论上深入探讨了在部分可观测条件下,MCR为何、何时以及如何提升插补质量,从而弥补了这一空白。我们的理论分析识别了导致MCR获得泛化优势的关键项,并将其推广至非理想的训练机制,证明了这种优势并非总能得到保证。基于这些理论洞见,我们提出了一种新颖的训练协议,通过监控对偶间隙来确定一个能保留泛化收益的早停点。随后,我们提供了详尽的实证证据来支持我们的理论主张,并展示了所提停止条件的有效性与准确性。我们还提供了一系列真实世界数据模拟,以展示MCR在为不同数据源设计的不同模型架构下的普适性。