Existing multi-expert learning-to-defer surrogates are statistically consistent, yet they can underfit, suppress useful experts, or degrade as the expert pool grows. We trace these failures to a shared architectural choice: casting classes and experts as actions inside one augmented prediction geometry. Consistency governs the population target; it says nothing about how the surrogate distributes gradient mass during training. We analyze five surrogates along both axes and show that each trades a fix on one for a failure on the other. We then introduce a decoupled surrogate that estimates the class posterior with a softmax and each expert utility with an independent sigmoid. It admits an $\mathcal{H}$-consistency bound whose constant is $J$-independent for fixed per-expert weight $β{=}λ/J$, and its gradients are free of the amplification, starvation, and coupling pathologies of the augmented family. Experiments on synthetic benchmarks, CIFAR-10, CIFAR-10H, and Covertype confirm that the decoupled surrogate is the only method that avoids amplification under redundancy, preserves rare specialists, and consistently improves over a standalone classifier across all settings.
翻译:现有的多专家学习延迟决策代理虽具备统计一致性,却可能面临欠拟合、压制有效专家或随专家池规模增大而性能退化等问题。我们追溯这些缺陷至共同的架构选择:将类别与专家作为动作嵌入单一增强预测几何空间中。一致性仅约束总体目标,却未规定代理在训练过程中如何分配梯度质量。我们沿两条轴线分析五种代理,揭示每种方法在解决某一缺陷时会引发另一缺陷。为此,我们提出解耦代理:采用softmax估计类别后验概率,并以独立sigmoid函数度量各专家效用。该代理满足$\mathcal{H}$一致性界,在固定专家权重$β{=}λ/J$下其常数与$J$无关,且梯度不存在增强型家族中的放大、饥饿与耦合病理现象。在合成基准、CIFAR-10、CIFAR-10H及Covertype上的实验证实:解耦代理是唯一在冗余场景下避免放大、保留稀有专家,并在所有设置中持续优于独立分类器的方法。