How can we collect the most useful labels to learn a model selection policy, when presented with arbitrary heterogeneous data streams? In this paper, we formulate this task as an online contextual active model selection problem, where at each round the learner receives an unlabeled data point along with a context. The goal is to output the best model for any given context without obtaining an excessive amount of labels. In particular, we focus on the task of selecting pre-trained classifiers, and propose a contextual active model selection algorithm (CAMS), which relies on a novel uncertainty sampling query criterion defined on a given policy class for adaptive model selection. In comparison to prior art, our algorithm does not assume a globally optimal model. We provide rigorous theoretical analysis for the regret and query complexity under both adversarial and stochastic settings. Our experiments on several benchmark classification datasets demonstrate the algorithm's effectiveness in terms of both regret and query complexity. Notably, to achieve the same accuracy, CAMS incurs less than 10% of the label cost when compared to the best online model selection baselines on CIFAR10.
翻译:如何从任意异构数据流中收集最有用的标签以学习模型选择策略?本文将此任务形式化为在线上下文主动模型选择问题,其中学习者在每一轮中接收一个未标注数据点及其对应上下文。目标是针对任意给定上下文输出最佳模型,同时避免获取过多标签。我们特别关注预训练分类器的选择任务,并提出一种上下文主动模型选择算法(CAMS),该算法基于给定策略类上定义的新型不确定性采样查询准则,实现自适应模型选择。与现有方法相比,我们的算法不假设存在全局最优模型。我们对对抗性和随机性设置下的遗憾值和查询复杂度进行了严格的理论分析。在多个基准分类数据集上的实验验证了该算法在遗憾值和查询复杂度方面的有效性。值得注意的是,在CIFAR10数据集上,为实现相同精度,CAMS比最佳在线模型选择基线方法少花费10%以上的标签成本。