We study the statistical properties of learning to defer (L2D) to multiple experts. In particular, we address the open problems of deriving a consistent surrogate loss, confidence calibration, and principled ensembling of experts. Firstly, we derive two consistent surrogates -- one based on a softmax parameterization, the other on a one-vs-all (OvA) parameterization -- that are analogous to the single expert losses proposed by Mozannar and Sontag (2020) and Verma and Nalisnick (2022), respectively. We then study the frameworks' ability to estimate P( m_j = y | x ), the probability that the jth expert will correctly predict the label for x. Theory shows the softmax-based loss causes mis-calibration to propagate between the estimates while the OvA-based loss does not (though in practice, we find there are trade offs). Lastly, we propose a conformal inference technique that chooses a subset of experts to query when the system defers. We perform empirical validation on tasks for galaxy, skin lesion, and hate speech classification.
翻译:我们研究了在多专家场景下推迟决策学习的统计性质,重点解决了推导一致代理损失、置信度校准以及专家集成原则等未解决问题。首先,我们推导出两种一致代理损失——一种基于softmax参数化,另一种基于一对其余(OvA)参数化——它们分别对应于Mozannar和Sontag (2020)以及Verma和Nalisnick (2022)针对单专家提出的损失函数。随后,我们研究了这些框架估计P(m_j = y | x)(即第j个专家正确预测样本x标签的概率)的能力。理论表明,基于softmax的损失会导致校准误差在估计值间传播,而基于OvA的损失则不会(尽管实际应用中我们发现存在权衡)。最后,我们提出了一种共形推断技术,在系统推迟决策时选择一组专家进行查询。我们在星系分类、皮肤病变分类和仇恨言论分类任务上进行了实证验证。