On Characterizing and Mitigating Imbalances in Multi-Instance Partial Label Learning

Multi-Instance Partial Label Learning (MI-PLL) is a weakly-supervised learning setting encompassing partial label learning, latent structural learning, and neurosymbolic learning. Differently from supervised learning, in MI-PLL, the inputs to the classifiers at training-time are tuples of instances $\textbf{x}$, while the supervision signal is generated by a function $\sigma$ over the gold labels of $\textbf{x}$. The gold labels are hidden during training. In this paper, we focus on characterizing and mitigating learning imbalances, i.e., differences in the errors occurring when classifying instances of different classes (aka class-specific risks), under MI-PLL. The phenomenon of learning imbalances has been extensively studied in the context of long-tail learning; however, the nature of MI-PLL introduces new challenges. Our contributions are as follows. From a theoretical perspective, we characterize the learning imbalances by deriving class-specific risk bounds that depend upon the function $\sigma$. Our theory reveals that learning imbalances exist in MI-PLL even when the hidden labels are uniformly distributed. On the practical side, we introduce a technique for estimating the marginal of the hidden labels using only MI-PLL data. Then, we introduce algorithms that mitigate imbalances at training- and testing-time, by treating the marginal of the hidden labels as a constraint. The first algorithm relies on a novel linear programming formulation of MI-PLL for pseudo-labeling. The second one adjusts a model's scores based on robust optimal transport. We demonstrate the effectiveness of our techniques using strong neurosymbolic and long-tail learning baselines, discussing also open challenges.

翻译：多示例部分标记学习（MI-PLL）是一种弱监督学习范式，其涵盖部分标记学习、潜在结构学习和神经符号学习。与监督学习不同，在MI-PLL中，训练阶段分类器的输入为实例元组$\textbf{x}$，而监督信号由黄金标签上的函数$\sigma$生成，且黄金标签在训练期间处于隐藏状态。本文重点研究MI-PLL框架下学习不平衡现象的表征与缓解方法，即不同类别实例在分类时产生的误差差异（亦称类别特定风险）。学习不平衡现象在长尾学习背景下已得到广泛研究，但MI-PLL的特性带来了新的挑战。我们的贡献如下：在理论层面，我们通过推导依赖于函数$\sigma$的类别特定风险界来表征学习不平衡性。理论分析表明，即使隐藏标签呈均匀分布，MI-PLL中仍存在学习不平衡现象。在实践层面，我们提出一种仅利用MI-PLL数据估计隐藏标签边缘分布的技术，并在此基础上开发了在训练和测试阶段通过将隐藏标签边缘分布作为约束条件来缓解不平衡的算法。第一种算法基于MI-PLL的新型线性规划伪标记方法，第二种算法则通过鲁棒最优传输调整模型得分。我们使用先进的神经符号学习和长尾学习基线验证了所提技术的有效性，同时讨论了尚未解决的挑战。