Macro-AUC is the arithmetic mean of the class-wise AUCs in multi-label learning and is commonly used in practice. However, its theoretical understanding is far lacking. Toward solving it, we characterize the generalization properties of various learning algorithms based on the corresponding surrogate losses w.r.t. Macro-AUC. We theoretically identify a critical factor of the dataset affecting the generalization bounds: \emph{the label-wise class imbalance}. Our results on the imbalance-aware error bounds show that the widely-used univariate loss-based algorithm is more sensitive to the label-wise class imbalance than the proposed pairwise and reweighted loss-based ones, which probably implies its worse performance. Moreover, empirical results on various datasets corroborate our theory findings. To establish it, technically, we propose a new (and more general) McDiarmid-type concentration inequality, which may be of independent interest.
翻译:Macro-AUC是多标签学习中各类别AUC的算术平均值,在实际应用中十分广泛。然而,其理论理解仍存在显著不足。为解决这一问题,我们基于Macro-AUC对应的替代损失函数,刻画了不同学习算法的泛化特性。我们从理论上确定了影响泛化界的一个关键数据集因素:**类别间样本不平衡**。基于不平衡感知误差界的研究结果表明,广泛使用的单变量损失函数算法比本文提出的成对损失与重加权损失函数算法对类别间样本不平衡更为敏感,这可能解释了其性能较差的成因。此外,多个数据集上的实验结果验证了我们的理论发现。在技术层面,为建立该理论,我们提出了一种新的(且更通用的)McDiarmid型浓度不等式,该不等式可能具有独立的研究价值。