Conformal prediction (CP) offers distribution-free uncertainty quantification for machine learning models, yet its interplay with fairness in downstream decision-making remains underexplored. Moving beyond CP as a standalone operation (procedural fairness), we analyze the holistic decision-making pipeline to evaluate substantive fairness-the equity of downstream outcomes. Theoretically, we derive an upper bound that decomposes prediction-set size disparity into interpretable components, clarifying how label-clustered CP helps control method-driven contributions to unfairness. To facilitate scalable empirical analysis, we introduce an LLM-in-the-loop evaluator that approximates human assessment of substantive fairness across diverse modalities. Our experiments show that label-clustered CP often provides a favorable balance between utility and substantive fairness, while reducing set-size disparities in line with our theory. Finally, we empirically show that equalized set sizes, rather than coverage, strongly correlate with improved substantive fairness, enabling practitioners to design more fair CP systems. Our code is available at https://github.com/layer6ai-labs/llm-in-the-loop-conformal-fairness.
翻译:共形预测(CP)为机器学习模型提供了无分布假设的不确定性量化方法,但其在下游决策中与公平性的相互作用仍待深入探索。本文超越将CP视为独立操作程序(过程公平性),通过分析整体决策流程来评估实质性公平性——即下游结果的公平性。理论上,我们推导出一个上界,将预测集大小差异分解为可解释的组成部分,阐明了标签聚类型CP如何有助于控制由方法驱动的对不公平性的贡献。为促进可扩展的实证分析,我们引入了一个集成大语言模型(LLM)的评估器,用于跨多种模态近似人类对实质性公平性的评估。实验结果表明,标签聚类型CP通常在效用与实质性公平性之间实现了良好平衡,同时根据我们的理论减少了集合大小的差异。最后,我们通过实证表明,均等化的集合大小(而非覆盖率)与实质性公平性的提升具有强相关性,这使得实践者能够设计更公平的CP系统。我们的代码开源在 https://github.com/layer6ai-labs/llm-in-the-loop-conformal-fairness。