We study optimization for losses that admit a variance-mean scale-mixture representation. Under this representation, each EM iteration is a weighted least squares update in which latent variables determine observation and parameter weights; these play roles analogous to Adam's second-moment scaling and AdamW's weight decay, but are derived from the model. The resulting Scale Mixture EM (SM-EM) algorithm removes user-specified learning-rate and momentum schedules. On synthetic ill-conditioned logistic regression benchmarks with $p \in \{20, \ldots, 500\}$, SM-EM with Nesterov acceleration attains up to $13\times$ lower final loss than Adam tuned by learning-rate grid search. For a 40-point regularization path, sharing sufficient statistics across penalty values yields a $10\times$ runtime reduction relative to the same tuned-Adam protocol. For the base (non-accelerated) algorithm, EM monotonicity guarantees nonincreasing objective values; adding Nesterov extrapolation trades this guarantee for faster empirical convergence.
翻译:我们研究了具有方差-均值尺度混合表示的损失函数优化问题。在此表示下,每次EM迭代均为加权最小二乘更新,其中隐变量决定观测值与参数权重;这些权重分别发挥类似于Adam二阶矩缩放与AdamW权重衰减的作用,但其源自模型本身。由此产生的尺度混合EM(SM-EM)算法无需用户预设学习率与动量调度机制。在维度$p \in \{20, \ldots, 500\}$的合成病态逻辑回归基准测试中,采用Nesterov加速的SM-EM算法相较于通过学习率网格搜索调优的Adam,最终损失降低达$13$倍。针对包含40个惩罚值的正则化路径计算,通过在惩罚值间共享充分统计量,相比相同调优Adam方案可获得$10$倍的运行时间缩减。对于基础(非加速)算法,EM单调性保证了目标函数值非递增;引入Nesterov外推技术虽牺牲此理论保证,但获得了更快的经验收敛速度。