Boosting is a commonly used technique to enhance the performance of a set of base models by combining them into a strong ensemble model. Though widely adopted, boosting is typically used in supervised learning where the data is labeled accurately. However, in weakly supervised learning, where most of the data is labeled through weak and noisy sources, it remains nontrivial to design effective boosting approaches. In this work, we show that the standard implementation of the convex combination of base learners can hardly work due to the presence of noisy labels. Instead, we propose $\textit{LocalBoost}$, a novel framework for weakly-supervised boosting. LocalBoost iteratively boosts the ensemble model from two dimensions, i.e., intra-source and inter-source. The intra-source boosting introduces locality to the base learners and enables each base learner to focus on a particular feature regime by training new base learners on granularity-varying error regions. For the inter-source boosting, we leverage a conditional function to indicate the weak source where the sample is more likely to appear. To account for the weak labels, we further design an estimate-then-modify approach to compute the model weights. Experiments on seven datasets show that our method significantly outperforms vanilla boosting methods and other weakly-supervised methods.
翻译:提升(Boosting)是一种常用技术,通过将一组基模型组合成强集成模型来提升其性能。尽管被广泛采用,提升通常用于数据标注准确的有监督学习。然而,在弱监督学习中,大部分数据通过弱标签和噪声源标注,设计有效的提升方法仍具挑战性。本研究表明,由于噪声标签的存在,标准基学习器凸组合实现难以奏效。为此,我们提出 $\textit{LocalBoost}$,一种新颖的弱监督提升框架。LocalBoost通过两个维度(即源内和源间)迭代提升集成模型。源内提升为基学习器引入局部性,通过在不同粒度的错误区域上训练新基学习器,使每个基学习器专注于特定特征区域。对于源间提升,我们利用条件函数指示样本更可能出现的弱监督源。为处理弱标签,我们进一步设计“先估计后修正”方法计算模型权重。在七个数据集上的实验表明,我们的方法显著优于普通提升方法及其他弱监督方法。