We study actively labeling streaming data, where an active learner is faced with a stream of data points and must carefully choose which of these points to label via an expensive experiment. Such problems frequently arise in applications such as healthcare and astronomy. We first study a setting when the data's inputs belong to one of $K$ discrete distributions and formalize this problem via a loss that captures the labeling cost and the prediction error. When the labeling cost is $B$, our algorithm, which chooses to label a point if the uncertainty is larger than a time and cost dependent threshold, achieves a worst-case upper bound of $\widetilde{O}(B^{\frac{1}{3}} K^{\frac{1}{3}} T^{\frac{2}{3}})$ on the loss after $T$ rounds. We also provide a more nuanced upper bound which demonstrates that the algorithm can adapt to the arrival pattern, and achieves better performance when the arrival pattern is more favorable. We complement both upper bounds with matching lower bounds. We next study this problem when the inputs belong to a continuous domain and the output of the experiment is a smooth function with bounded RKHS norm. After $T$ rounds in $d$ dimensions, we show that the loss is bounded by $\widetilde{O}(B^{\frac{1}{d+3}} T^{\frac{d+2}{d+3}})$ in an RKHS with a squared exponential kernel and by $\widetilde{O}(B^{\frac{1}{2d+3}} T^{\frac{2d+2}{2d+3}})$ in an RKHS with a Mat\'ern kernel. Our empirical evaluation demonstrates that our method outperforms other baselines in several synthetic experiments and two real experiments in medicine and astronomy.
翻译:我们研究主动标注流数据问题,其中主动学习器面对数据点流,必须通过昂贵实验精心选择需要标注的点。此类问题在医疗保健和天文学等应用中频繁出现。我们首先研究数据输入属于$K$个离散分布之一的情景,并通过捕捉标注成本和预测误差的损失函数形式化该问题。当标注成本为$B$时,我们的算法通过选择在不确定性大于时间相关和成本相关阈值时标注点,在$T$轮后实现$\widetilde{O}(B^{\frac{1}{3}} K^{\frac{1}{3}} T^{\frac{2}{3}})$的最坏情况上界。我们还提供一个更精细的上界,表明该算法可适应到达模式,并在到达模式更有利时实现更优性能。我们通过匹配下界补充了这两个上界。接下来,我们研究输入属于连续域且实验输出为具有有界RKHS范数的光滑函数时的该问题。在$d$维空间中经过$T$轮后,我们证明在具有平方指数核的RKHS中,损失上界为$\widetilde{O}(B^{\frac{1}{d+3}} T^{\frac{d+2}{d+3}})$,在具有Matérn核的RKHS中,损失上界为$\widetilde{O}(B^{\frac{1}{2d+3}} T^{\frac{2d+2}{2d+3}})$。我们的实证评估表明,该方法在多项合成实验以及医学和天文学的两个真实实验中优于其他基线方法。