Towards Scalable Oversight via Partitioned Human Supervision

As artificial intelligence (AI) systems approach and surpass expert human performance across a broad range of tasks, obtaining high-quality human supervision for evaluation and training becomes increasingly challenging. Our focus is on tasks that require deep knowledge and skills of multiple domains, where this bottleneck is severe. Unfortunately, even the best human experts are knowledgeable only in a single narrow area, and will not be able to evaluate the correctness of advanced AI systems on such superhuman tasks. However, based on their narrow expertise, humans may provide a weak signal, i.e., a complementary label indicating an option that is incorrect. For example, a cardiologist could state that ''this is not related to any cardiovascular disease,'' even if they cannot identify the true disease. Based on this weak signal, we propose a scalable oversight framework that enables us to evaluate frontier AI systems without the need to prepare the ground truth. We derive an unbiased estimator of top-1 accuracy from complementary labels and quantify how many complementary labels are needed to match the variance of ordinary labels. We further introduce two estimators to combine scarce ordinary labels with abundant complementary labels. We provide finite-sample deviation guarantees for both complementary-only and the mixed estimators. Empirically, we show that we can evaluate the output of large language models without the ground truth, if we have complementary labels. We further show that we can train an AI system with such weak signals: we show how we can design an agentic AI system automatically that can improve itself with this partitioned human supervision. Our code is available at https://github.com/R-Yin-217/Towards-Scalable-Oversight-via-Partitioned-Human-Supervision.

翻译：随着人工智能（AI）系统在广泛任务中接近并超越人类专家水平，获取高质量人类监督用于评估和训练变得日益困难。我们关注那些需要多领域深度知识与技能的任务，其中这一瓶颈尤为严重。遗憾的是，即使最优秀的人类专家也仅精通单一狭窄领域，无法对此类超人类任务中先进AI系统的正确性进行评估。然而，基于其专业领域知识，人类可提供弱信号——即指示错误选项的互补标签。例如，心脏病专家可断言“这与任何心血管疾病无关”，即使其无法确定真实疾病。基于此弱信号，我们提出一种可扩展监督框架，使得无需准备真实标签即可评估前沿AI系统。我们从互补标签推导出无偏的Top-1准确率估计量，并量化需要多少互补标签才能匹配普通标签的方差。进一步引入两种估计量以融合稀缺的普通标签与丰富的互补标签。我们为纯互补标签及混合估计量提供了有限样本偏差保证。实证表明，在拥有互补标签的情况下，我们无需真实标签即可评估大语言模型的输出。我们还证明可利用此类弱信号训练AI系统：展示了如何设计能通过分区人类监督实现自我改进的智能体AI系统。代码发布于https://github.com/R-Yin-217/Towards-Scalable-Oversight-via-Partitioned-Human-Supervision。