We study the problem of conformal prediction in a novel online framework that directly optimizes efficiency. In our problem, we are given a target miscoverage rate $\alpha > 0$, and a time horizon $T$. On each day $t \le T$ an algorithm must output an interval $I_t \subseteq [0, 1]$, then a point $y_t \in [0, 1]$ is revealed. The goal of the algorithm is to achieve coverage, that is, $y_t \in I_t$ on (close to) a $(1 - \alpha)$-fraction of days, while maintaining efficiency, that is, minimizing the average volume (length) of the intervals played. This problem is an online analogue to the problem of constructing efficient confidence intervals. We study this problem over arbitrary and exchangeable (random order) input sequences. For exchangeable sequences, we show that it is possible to construct intervals that achieve coverage $(1 - \alpha) - o(1)$, while having length upper bounded by the best fixed interval that achieves coverage in hindsight. For arbitrary sequences however, we show that any algorithm that achieves a $\mu$-approximation in average length compared to the best fixed interval achieving coverage in hindsight, must make a multiplicative factor more mistakes than $\alpha T$, where the multiplicative factor depends on $\mu$ and the aspect ratio of the problem. Our main algorithmic result is a matching algorithm that can recover all Pareto-optimal settings of $\mu$ and number of mistakes. Furthermore, our algorithm is deterministic and therefore robust to an adaptive adversary. This gap between the exchangeable and arbitrary settings is in contrast to the classical online learning problem. In fact, we show that no single algorithm can simultaneously be Pareto-optimal for arbitrary sequences and optimal for exchangeable sequences. On the algorithmic side, we give an algorithm that achieves the near-optimal tradeoff between the two cases.
翻译:本文研究保形预测问题,提出一种直接优化效率的新型在线框架。在该问题中,给定目标错误覆盖率 $\alpha > 0$ 和时间范围 $T$。在每天 $t \le T$,算法必须输出区间 $I_t \subseteq [0, 1]$,随后揭示数据点 $y_t \in [0, 1]$。算法的目标是实现覆盖率——即在接近 $(1 - \alpha)$ 比例的天数内满足 $y_t \in I_t$,同时保持效率——即最小化所输出区间的平均体积(长度)。该问题是构建高效置信区间问题的在线类比。我们研究该问题在任意序列和可交换(随机顺序)输入序列上的表现。对于可交换序列,我们证明可以构建实现 $(1 - \alpha) - o(1)$ 覆盖率的区间,其长度上界受限于事后能达到覆盖率的最佳固定区间。然而对于任意序列,我们证明任何算法若要在平均长度上实现相对于事后能达到覆盖率的最佳固定区间的 $\mu$ 近似,其错误次数必须超过 $\alpha T$ 的某个乘性因子,该因子取决于 $\mu$ 和问题的纵横比。我们的主要算法成果是提出一种匹配算法,能够恢复 $\mu$ 与错误次数的所有帕累托最优配置。此外,该算法是确定性的,因此对自适应对抗具有鲁棒性。可交换场景与任意场景之间的性能差异与经典在线学习问题形成对比。事实上,我们证明不存在单一算法能同时实现任意序列的帕累托最优和可交换序列的最优性。在算法设计方面,我们提出一种能在两种场景间实现近乎最优权衡的算法。