Conformal Tradeoffs: Guarantees Beyond Coverage

Deployed conformal predictors are long-lived decision infrastructure operating over finite operational windows. The real-world question is not only ``Does the true label lie in the prediction set at the target rate?'' (marginal coverage), but ``How often does the system commit versus defer? What error exposure does it induce when it acts? How do these rates trade off?'' Marginal coverage does not determine these deployment-facing quantities: the same calibrated thresholds can yield different operational profiles depending on score geometry. We provide a framework for operational certification and planning beyond coverage with three contributions. (1) Small-Sample Beta Correction (SSBC): we invert the exact finite-sample Beta/rank law for split conformal to map a user request $(α^\star,δ)$ to a calibrated grid point with PAC-style semantics, yielding explicit finite-window coverage guarantees. (2) Calibrate-and-Audit: since no distribution-free pivot exists for rates beyond coverage, we introduce a two-stage design in which an independent audit set produces a reusable region -- label table and certified finite-window envelopes (Binomial/Beta-Binomial) for operational quantities -- commitment frequency, deferral, decisive error exposure, and commit purity -- via linear projection. (3) Geometric characterization: we describe feasibility constraints, regime boundaries (hedging vs.\ rejection), and cost-coherence conditions induced by a fixed conformal partition, explaining why operational rates are coupled and how calibration navigates their trade-offs. The output is an auditable operational menu: for a fixed scoring model, we trace attainable operational profiles across calibration settings and attach finite-window uncertainty envelopes. We demonstrate the approach on Tox21 toxicity prediction (12 endpoints) and aqueous solubility screening using AquaSolDB.

翻译：已部署的保形预测器是在有限操作窗口内长期运行的决策基础设施。现实世界中的问题不仅是"真实标签是否以目标频率出现在预测集中？"（边际覆盖），还包括"系统执行决策与推迟决策的频率如何？它在执行决策时会引发多少错误风险？这些频率之间如何权衡？"边际覆盖并不能决定这些面向部署的指标：相同的校准阈值可能因评分几何结构的不同而产生不同的操作特征。我们提出了一个超越覆盖的操作认证与规划框架，包含三项贡献。（1）小样本贝塔校正（SSBC）：我们反转分割保形中精确有限样本的贝塔/秩定律，将用户请求$(α^\star,δ)$映射到具有PAC风格语义的校准网格点，从而提供明确的有限窗口覆盖保证。（2）校准-审计双阶段设计：由于不存在超越覆盖率的无分布枢轴量，我们引入两阶段设计，其中独立审计集生成可复用的区域-标签表，并通过线性投影为操作指标——决策频率、推迟率、确定性错误暴露和决策纯度——提供经认证的有限窗口包络（二项/贝塔二项分布）。（3）几何特征刻画：我们描述了由固定保形划分引发的可行性约束、机制边界（对冲与拒绝）以及成本一致性条件，解释了操作频率为何相互耦合，以及校准如何引导其权衡。最终输出是可审计的操作菜单：针对固定评分模型，我们追踪不同校准设置下可达到的操作特征，并附加有限窗口不确定性包络。我们在Tox21毒性预测（12个端点）和使用AquaSolDB的水溶性筛选任务中验证了该方法。