Training machine learning models from data with weak supervision and dataset shifts is still challenging. Designing algorithms when these two situations arise has not been explored much, and existing algorithms cannot always handle the most complex distributional shifts. We think the biquality data setup is a suitable framework for designing such algorithms. Biquality Learning assumes that two datasets are available at training time: a trusted dataset sampled from the distribution of interest and the untrusted dataset with dataset shifts and weaknesses of supervision (aka distribution shifts). The trusted and untrusted datasets available at training time make designing algorithms dealing with any distribution shifts possible. We propose two methods, one inspired by the label noise literature and another by the covariate shift literature for biquality learning. We experiment with two novel methods to synthetically introduce concept drift and class-conditional shifts in real-world datasets across many of them. We opened some discussions and assessed that developing biquality learning algorithms robust to distributional changes remains an interesting problem for future research.
翻译:从弱监督和数据偏移中训练机器学习模型仍然具有挑战性。当这两种情况同时出现时,算法设计尚未得到充分探索,现有算法也无法始终处理最复杂的分布偏移。我们认为双质量数据设置是设计此类算法的合适框架。双质量学习假设训练时存在两个数据集:从目标分布中采样的可信数据集,以及存在数据集偏移和监督缺陷(即分布偏移)的不可信数据集。训练时可用的可信与不可信数据集使得设计处理任意分布偏移的算法成为可能。我们提出了两种方法:一种受标签噪声文献启发,另一种受协变量偏移文献启发,用于双质量学习。我们通过两种新方法在多个真实世界数据集中合成引入概念漂移和类条件偏移。我们开展了相关讨论并评估认为:开发对分布变化具有鲁棒性的双质量学习算法仍是未来研究的有趣问题。