Probabilistic prediction systems often aggregate probability estimates from multiple models into a single decision. A common assumption is that if each model is individually calibrated, the aggregate prediction will also be well calibrated. We show that this assumption fails in multi-agent settings: individually calibrated predictors can become collectively miscalibrated when their predictions interact strategically, in the game-theoretic sense of Brier-optimal local response, even without deliberate coordination. This phenomenon arises naturally when agents are independently trained on overlapping data. We prove that under Brier-score-based aggregation with positively correlated beliefs, each agent's individually optimal report systematically underestimates the positive-class probability, yielding a Price of Anarchy greater than one whenever Cov(b_i, b_j) > 0. In a canonical setting (n = 5 agents, pairwise correlation = 0.5, base rate = 0.3), the empirically measured PoA in false-negative rate reaches 7.25x. In contrast, VCG-based aggregation aligns incentives by rewarding marginal contribution, achieving dominant-strategy incentive compatibility and near-optimal performance. Experiments on three real-world datasets (NSL-KDD, UNSW-NB15, Credit Card Fraud) show that VCG provides strong robustness while maintaining comparable accuracy. It performs particularly well in data-sparse and adversarial settings, and adaptive weighting further improves performance under distribution shift.
翻译:概率预测系统常将多个模型给出的概率估计聚合为单一决策。一个普遍假设是:若每个模型均独立校准,聚合预测结果也将校准良好。我们证明该假设在智能体交互场景中失效——即使没有刻意协调,当各预测器通过博弈论意义上的布里尔最优局部响应进行策略性交互时,个体校准的预测器可能产生集体失准。这一现象在智能体基于重叠数据独立训练时自然出现。我们证明:在基于布里尔分数且信念正相关的聚合机制下,每个智能体的个体最优报告系统性地低估正类概率,当Cov(b_i, b_j) > 0时,"无为成本"恒大于1。在典型场景(n=5个智能体,成对相关性=0.5,基率=0.3)下,假阴性率的经验测量PoA达到7.25倍。相比之下,基于VCG机制的聚合通过奖励边际贡献对齐激励,实现占优策略激励相容性及接近最优的性能。在三个真实数据集(NSL-KDD、UNSW-NB15、信用卡欺诈)上的实验表明,VCG在保持可比精度的同时展现出强鲁棒性,尤其在数据稀疏和对抗性场景中表现优异,自适应加权策略进一步提升了分布偏移下的性能。