In many applications, especially due to lack of supervision or privacy concerns, the training data is grouped into bags of instances (feature-vectors) and for each bag we have only an aggregate label derived from the instance-labels in the bag. In learning from label proportions (LLP) the aggregate label is the average of the instance-labels in a bag, and a significant body of work has focused on training models in the LLP setting to predict instance-labels. In practice however, the training data may have fully supervised albeit covariate-shifted source data, along with the usual target data with bag-labels, and we wish to train a good instance-level predictor on the target domain. We call this the covariate-shifted hybrid LLP problem. Fully supervised covariate shifted data often has useful training signals and the goal is to leverage them for better predictive performance in the hybrid LLP setting. To achieve this, we develop methods for hybrid LLP which naturally incorporate the target bag-labels along with the source instance-labels, in the domain adaptation framework. Apart from proving theoretical guarantees bounding the target generalization error, we also conduct experiments on several publicly available datasets showing that our methods outperform LLP and domain adaptation baselines as well techniques from previous related work.
翻译:在许多应用中,特别是由于监督缺失或隐私考虑,训练数据被分组为实例(特征向量)的包,且每个包仅具有从包内实例标签导出的聚合标签。在从标签比例(LLP)学习中,聚合标签是包内实例标签的平均值,已有大量研究专注于在LLP设置下训练模型以预测实例标签。然而在实际应用中,训练数据可能包含完全监督但存在协变量偏移的源数据,以及通常带有包标签的目标数据,我们希望能在目标域训练一个良好的实例级预测器。我们将此称为协变量偏移混合LLP问题。完全监督的协变量偏移数据通常包含有用的训练信号,目标是在混合LLP设置中利用这些信号以获得更好的预测性能。为此,我们在领域自适应框架下开发了混合LLP方法,该方法自然整合了目标包标签与源实例标签。除了通过理论证明给出目标泛化误差的界,我们还在多个公开数据集上进行了实验,结果表明我们的方法优于LLP和领域自适应基线以及先前相关工作中的技术。