With the increasing practicality of deep learning applications, practitioners are inevitably faced with datasets corrupted by noise from various sources such as measurement errors, mislabeling, and estimated surrogate inputs/outputs that can adversely impact the optimization results. It is a common practice to improve the optimization algorithm's robustness to noise, since this algorithm is ultimately in charge of updating the network parameters. Previous studies revealed that the first-order moment used in Adam-like stochastic gradient descent optimizers can be modified based on the Student's t-distribution. While this modification led to noise-resistant updates, the other associated statistics remained unchanged, resulting in inconsistencies in the assumed models. In this paper, we propose AdaTerm, a novel approach that incorporates the Student's t-distribution to derive not only the first-order moment but also all the associated statistics. This provides a unified treatment of the optimization process, offering a comprehensive framework under the statistical model of the t-distribution for the first time. The proposed approach offers several advantages over previously proposed approaches, including reduced hyperparameters and improved robustness and adaptability. This noise-adaptive behavior contributes to AdaTerm's exceptional learning performance, as demonstrated through various optimization problems with different and/or unknown noise ratios. Furthermore, we introduce a new technique for deriving a theoretical regret bound without relying on AMSGrad, providing a valuable contribution to the field
翻译:随着深度学习应用日益普及,从业者不可避免地面临被噪声污染的数据集,这些噪声源于测量误差、错误标注以及估计的替代输入/输出等,可能对优化结果产生不利影响。提升优化算法对噪声的鲁棒性是一种常见做法,因为该算法最终负责更新网络参数。先前研究表明,基于Adam的随机梯度下降优化器中使用的一阶矩可基于学生t分布进行修改。尽管这种修改带来了抗噪更新,但其他相关统计量保持不变,导致假设模型存在不一致性。本文提出AdaTerm——一种创新方法,不仅利用学生t导出一阶矩,还推导所有相关统计量。这为优化过程提供了统一处理,首次在t分布统计模型下构建了综合框架。与先前方法相比,本方法具有减少超参数、提升鲁棒性和适应性等多重优势。这种噪声自适应行为助力了AdaTerm卓越的学习性能,并通过不同和/或未知噪声比的各种优化问题得到验证。此外,我们引入了一种无需依赖AMSGrad推导理论遗憾界的新技术,为该领域提供了宝贵贡献。