Deep ensembles have recently gained popularity in the deep learning community for their conceptual simplicity and efficiency. However, maintaining functional diversity between ensemble members that are independently trained with gradient descent is challenging. This can lead to pathologies when adding more ensemble members, such as a saturation of the ensemble performance, which converges to the performance of a single model. Moreover, this does not only affect the quality of its predictions, but even more so the uncertainty estimates of the ensemble, and thus its performance on out-of-distribution data. We hypothesize that this limitation can be overcome by discouraging different ensemble members from collapsing to the same function. To this end, we introduce a kernelized repulsive term in the update rule of the deep ensembles. We show that this simple modification not only enforces and maintains diversity among the members but, even more importantly, transforms the maximum a posteriori inference into proper Bayesian inference. Namely, we show that the training dynamics of our proposed repulsive ensembles follow a Wasserstein gradient flow of the KL divergence with the true posterior. We study repulsive terms in weight and function space and empirically compare their performance to standard ensembles and Bayesian baselines on synthetic and real-world prediction tasks.
翻译:深度集成近年来在深度学习社区中因其概念简单性和高效性而备受青睐。然而,保持通过梯度下降独立训练的集成成员之间的功能多样性具有挑战性。这可能导致添加更多集成成员时出现病态现象,例如集成性能饱和,最终收敛到单一模型的性能。此外,这不仅影响其预测质量,更影响集成的不确定性估计,进而影响其在分布外数据上的性能。我们假设这种限制可以通过阻止不同集成成员坍缩到相同函数来克服。为此,我们在深度集成的更新规则中引入了一个核化排斥项。我们表明,这一简单修改不仅强制并保持了成员之间的多样性,更重要的是,它将最大后验推断转化为真正的贝叶斯推断。即,我们证明所提出的排斥集成的训练动态遵循与真实后验的KL散度的Wasserstein梯度流。我们在权重和函数空间中研究了排斥项,并经验性地将其性能与标准集成和贝叶斯基线在合成和真实世界预测任务上进行了比较。