Ensemble models often improve generalization performances in challenging tasks. Yet, traditional techniques based on prediction averaging incur three well-known disadvantages: the computational overhead of training multiple models, increased latency, and memory requirements at test time. To address these issues, the Stochastic Weight Averaging (SWA) technique maintains a running average of model parameters from a specific epoch onward. Despite its potential benefits, maintaining a running average of parameters can hinder generalization, as an underlying running model begins to overfit. Conversely, an inadequately chosen starting point can render SWA more susceptible to underfitting compared to an underlying running model. In this work, we propose Adaptive Stochastic Weight Averaging (ASWA) technique that updates a running average of model parameters, only when generalization performance is improved on the validation dataset. Hence, ASWA can be seen as a combination of SWA with the early stopping technique, where the former accepts all updates on a parameter ensemble model and the latter rejects any update on an underlying running model. We conducted extensive experiments ranging from image classification to multi-hop reasoning over knowledge graphs. Our experiments over 11 benchmark datasets with 7 baseline models suggest that ASWA leads to a statistically better generalization across models and datasets
翻译:集成模型通常能提升在复杂任务中的泛化性能。然而,基于预测平均的传统技术存在三个众所周知的缺点:训练多个模型带来的计算开销、测试时延迟的增加以及内存需求。为解决这些问题,随机权重平均(SWA)技术从特定轮次开始维护模型参数的运行平均值。尽管具有潜在优势,但维护参数的运行平均值可能阻碍泛化,因为基础运行模型开始过拟合。相反,若起始点选择不当,与基础运行模型相比,SWA可能更容易出现欠拟合。在本研究中,我们提出自适应随机权重平均(ASWA)技术,该技术仅在验证数据集上泛化性能提升时,才更新模型参数的运行平均值。因此,ASWA可被视为SWA与早停技术的结合,前者接受参数集成模型的所有更新,而后者拒绝基础运行模型的任何更新。我们进行了从图像分类到知识图谱多跳推理的广泛实验。在11个基准数据集和7个基线模型上的实验表明,ASWA能在不同模型和数据集上实现统计意义上更优的泛化性能。