Density aggregation is a central problem in machine learning, for instance when combining predictions from a Deep Ensemble. The choice of aggregation remains an open question with two commonly proposed approaches being linear pooling (probability averaging) and geometric pooling (logit averaging). In this work, we address this question by studying the normalized generalized mean of order $r \in \mathbb{R} \cup \{-\infty,+\infty\}$ through the lens of log-likelihood, the standard evaluation criterion in machine learning. This provides a unifying aggregation formalism and shows different optimal configurations for different situations. We show that the regime $r \in [0,1]$ is the only range ensuring systematic improvements relative to individual distributions, thereby providing a principled justification for the reliability and widespread practical use of linear ($r=1$) and geometric ($r=0$) pooling. In contrast, we show that aggregation rules with $r \notin [0,1]$ may fail to provide consistent gains with explicit counterexamples. Finally, we corroborate our theoretical findings with empirical evaluations using Deep Ensembles on image and text classification benchmarks.
翻译:密度聚合是机器学习中的一个核心问题,例如在组合深度集成(Deep Ensemble)的预测时。聚合方法的选择仍是一个开放性问题,其中两种常见方法是线性池化(概率平均)和几何池化(对数几率平均)。在本工作中,我们通过机器学习中的标准评估准则——对数似然——的视角,研究阶数 $r \in \mathbb{R} \cup \{-\infty,+\infty\}$ 的归一化广义均值,从而探讨这一问题。这提供了一个统一的聚合形式化框架,并展示了不同情境下的最优配置。我们证明,$r \in [0,1]$ 是唯一能确保相对于单个分布实现系统性改进的区间,从而为线性池化($r=1$)和几何池化($r=0$)的可靠性及广泛实际应用提供了理论依据。相比之下,我们通过明确的反例表明,$r \notin [0,1]$ 的聚合规则可能无法提供一致的性能提升。最后,我们通过在图像和文本分类基准上使用深度集成进行实证评估,验证了我们的理论发现。