We introduce an adaptive method with formal quality guarantees for weak supervision in a non-stationary setting. Our goal is to infer the unknown labels of a sequence of data by using weak supervision sources that provide independent noisy signals of the correct classification for each data point. This setting includes crowdsourcing and programmatic weak supervision. We focus on the non-stationary case, where the accuracy of the weak supervision sources can drift over time, e.g., because of changes in the underlying data distribution. Due to the drift, older data could provide misleading information to infer the label of the current data point. Previous work relied on a priori assumptions on the magnitude of the drift to decide how much data to use from the past. Comparatively, our algorithm does not require any assumptions on the drift, and it adapts based on the input. In particular, at each step, our algorithm guarantees an estimation of the current accuracies of the weak supervision sources over a window of past observations that minimizes a trade-off between the error due to the variance of the estimation and the error due to the drift. Experiments on synthetic and real-world labelers show that our approach indeed adapts to the drift. Unlike fixed-window-size strategies, it dynamically chooses a window size that allows it to consistently maintain good performance.
翻译:我们提出了一种具有形式化质量保证的自适应方法,用于处理非平稳环境下的弱监督问题。目标是通过弱监督源对每个数据点提供的独立噪声信号,推断未知数据序列的标签。该场景包括众包和程序化弱监督。我们重点关注非平稳情况,其中弱监督源的准确率可能随时间漂移——例如由于底层数据分布的变化。由于漂移的存在,旧数据可能为推断当前数据点的标签提供误导性信息。以往的工作依赖对漂移幅度的先验假设来决定使用多少历史数据。相比之下,我们的算法无需对漂移做任何假设,且能根据输入自适应调整。具体而言,在每一步中,算法通过一个历史观测窗口保证了弱监督源当前准确率的估计,该窗口大小在估计方差误差与漂移误差之间实现了最优权衡。在合成和真实标注者上的实验表明,我们的方法确实能够自适应于数据漂移。与固定窗口大小策略不同,它能动态选择窗口大小,从而持续保持良好性能。