Given a (machine learning) classifier and a collection of unlabeled data, how can we efficiently identify misclassification patterns presented in this dataset? To address this problem, we propose a human-machine collaborative framework that consists of a team of human annotators and a sequential recommendation algorithm. The recommendation algorithm is conceptualized as a stochastic sampler that, in each round, queries the annotators a subset of samples for their true labels and obtains the feedback information on whether the samples are misclassified. The sampling mechanism needs to balance between discovering new patterns of misclassification (exploration) and confirming the potential patterns of classification (exploitation). We construct a determinantal point process, whose intensity balances the exploration-exploitation trade-off through the weighted update of the posterior at each round to form the generator of the stochastic sampler. The numerical results empirically demonstrate the competitive performance of our framework on multiple datasets at various signal-to-noise ratios.
翻译:给定一个(机器学习)分类器和一个未标注数据集合,我们如何高效识别该数据集中存在的错分模式?为解决此问题,我们提出了一种人机协作框架,该框架由一组人工标注员和一个序列式推荐算法组成。该推荐算法被概念化为一个随机采样器,在每一轮中向标注员查询部分样本的真实标签,并获取样本是否被误分类的反馈信息。采样机制需要在发现新的错分模式(探索)与确认潜在分类模式(利用)之间取得平衡。我们构建了一个行列式点过程,通过每轮后验信息的加权更新来调节强度函数,从而平衡探索-利用的权衡,并生成随机采样器的候选样本集。数值实验结果从经验上证明了我们的框架在多个不同信噪比数据集上的竞争性表现。