Streaming algorithms for evaluating noisy judges on unlabeled data -- binary classification

The evaluation of noisy binary classifiers on unlabeled data is treated as a streaming task: given a data sketch of the decisions by an ensemble, estimate the true prevalence of the labels as well as each classifier's accuracy on them. Two fully algebraic evaluators are constructed to do this. Both are based on the assumption that the classifiers make independent errors. The first is based on majority voting. The second, the main contribution of the paper, is guaranteed to be correct. But how do we know the classifiers are independent on any given test? This principal/agent monitoring paradox is ameliorated by exploiting the failures of the independent evaluator to return sensible estimates. A search for nearly error independent trios is empirically carried out on the \texttt{adult}, \texttt{mushroom}, and \texttt{two-norm} datasets by using the algebraic failure modes to reject evaluation ensembles as too correlated. The searches are refined by constructing a surface in evaluation space that contains the true value point. The algebra of arbitrarily correlated classifiers permits the selection of a polynomial subset free of any correlation variables. Candidate evaluation ensembles are rejected if their data sketches produce independent estimates too far from the constructed surface. The results produced by the surviving ensembles can sometimes be as good as 1\%. But handling even small amounts of correlation remains a challenge. A Taylor expansion of the estimates produced when independence is assumed but the classifiers are, in fact, slightly correlated helps clarify how the independent evaluator has algebraic `blind spots'.

翻译：将未标注数据上噪声二分类器的评估视为流式任务：基于集成分类器决策的数据草图，估计标签的真实流行度以及每个分类器在其上的准确率。为此构建了两种完全代数化的评估器。两者均基于分类器独立产生错误的假设。第一种基于多数投票。第二种（本文的主要贡献）保证正确。但如何确保分类器在任意测试集上独立？这一主/代理监控悖论通过利用独立评估器无法返回合理估计值的失败模式得以缓解。通过在\texttt{adult}、\texttt{mushroom}和\texttt{two-norm}数据集上实证搜索近似误差独立的三元组，利用代数失效模式剔除相关性过高的评估集成。通过在评估空间中构建包含真实值点的曲面来优化搜索过程。任意相关分类器的代数结构允许选择一个不包含任何相关变量的多项式子集。若候选评估集成数据草图产生的独立估计值偏离构建曲面过远，则予以剔除。存活集成产生的评估结果有时可达到1%的精度。但处理即使少量相关性仍具挑战性。当假设独立但分类器实际存在轻微相关性时，对估计值进行泰勒展开，有助于阐明独立评估器在代数上存在“盲区”。