Macro-AUC is the arithmetic mean of the class-wise AUCs in multi-label learning and is commonly used in practice. However, its theoretical understanding is far lacking. Toward solving it, we characterize the generalization properties of various learning algorithms based on the corresponding surrogate losses w.r.t. Macro-AUC. We theoretically identify a critical factor of the dataset affecting the generalization bounds: \emph{the label-wise class imbalance}. Our results on the imbalance-aware error bounds show that the widely-used univariate loss-based algorithm is more sensitive to the label-wise class imbalance than the proposed pairwise and reweighted loss-based ones, which probably implies its worse performance. Moreover, empirical results on various datasets corroborate our theory findings. To establish it, technically, we propose a new (and more general) McDiarmid-type concentration inequality, which may be of independent interest.
翻译:宏观AUC是多标签学习中各类别AUC的算术平均值,在实际应用中广泛使用。然而,其理论理解仍十分欠缺。为此,我们基于相应的替代损失函数,从宏观AUC角度刻画了各类学习算法的泛化特性。我们从理论上识别出影响泛化边界的一个关键数据集因素:\emph{标签层面的类别不平衡}。基于不平衡感知误差边界的研究结果表明,广泛使用的单变量损失算法比我们提出的成对损失算法和重加权损失算法对标签层面的类别不平衡更敏感,这很可能解释了其性能较差的成因。此外,多组数据集的实证结果验证了我们的理论发现。在技术层面,为建立该理论,我们提出了一种新的(且更通用的)McDiarmid型浓度不等式,该不等式本身可能具有独立的研究价值。