MLP is a heavily used backbone in modern deep learning (DL) architectures for supervised learning on tabular data, and AdamW is the go-to optimizer used to train tabular DL models. Unlike architecture design, however, the choice of optimizer for tabular DL has not been examined systematically, despite new optimizers showing promise in other domains. To fill this gap, we benchmark \Noptimizers optimizers on \Ndatasets tabular datasets for training MLP-based models in the standard supervised learning setting under a shared experiment protocol. Our main finding is that the Muon optimizer consistently outperforms AdamW, and thus should be considered a strong and practical choice for practitioners and researchers, if the associated training efficiency overhead is affordable. Additionally, we find exponential moving average of model weights to be a simple yet effective technique that improves AdamW on vanilla MLPs, though its effect is less consistent across model variants.
翻译:MLP是现代深度学习(DL)架构中用于表格数据监督学习的核心骨干网络,而AdamW是训练表格DL模型的首选优化器。然而,与架构设计不同,尽管新型优化器在其他领域展现出潜力,但表格DL中优化器的选择尚未得到系统研究。为填补这一空白,我们在统一实验协议下,对\Noptimizers种优化器在\Ndatasets个表格数据集上进行了基准测试,用于训练基于MLP的监督学习标准设定模型。主要发现是:Muon优化器始终优于AdamW,因此在训练效率开销可接受的情况下,应被视为从业者和研究人员的强效实用选择。此外,我们发现模型权重的指数移动平均是一种简单有效的方法,能提升普通MLP上的AdamW性能,但其效果在不同模型变体中并不一致。