The Sample Complexity of Multicalibration

We study the minimax sample complexity of multicalibration in the batch setting. A learner observes $n$ i.i.d. samples from an unknown distribution and must output a (possibly randomized) predictor whose population multicalibration error, measured by Expected Calibration Error (ECE), is at most $\varepsilon$ with respect to a given family of groups. For every fixed $κ> 0$, in the regime $|G|\le \varepsilon^{-κ}$, we prove that $\widetildeΘ(\varepsilon^{-3})$ samples are necessary and sufficient, up to polylogarithmic factors. The lower bound holds even for randomized predictors, and the upper bound is realized by a randomized predictor obtained via an online-to-batch reduction. This separates the sample complexity of multicalibration from that of marginal calibration, which scales as $\widetildeΘ(\varepsilon^{-2})$, and shows that mean-ECE multicalibration is as difficult in the batch setting as it is in the online setting, in contrast to marginal calibration which is strictly more difficult in the online setting. In contrast we observe that for $κ= 0$, the sample complexity of multicalibration remains $\widetildeΘ(\varepsilon^{-2})$ exhibiting a sharp threshold phenomenon. More generally, we establish matching upper and lower bounds, up to polylogarithmic factors, for a weighted $L_p$ multicalibration metric for all $1 \le p \le 2$, with optimal exponent $3/p$. We also extend the lower-bound template to a regular class of elicitable properties, and combine it with the online upper bounds of Hu et al. (2025) to obtain matching bounds for calibrating properties including expectiles and bounded-density quantiles.

翻译：我们研究了批处理环境下多校准的最小最大样本复杂度。学习器从未知分布中观测到 $n$ 个独立同分布样本，并必须输出一个（可能是随机化的）预测器，其预测总体多校准误差（由期望校准误差（ECE）度量）相对于给定组族至多为 $\varepsilon$。对于每个固定的 $κ> 0$，在 $|G|\le \varepsilon^{-κ}$ 条件下，我们证明样本复杂度为 $\widetildeΘ(\varepsilon^{-3})$，即该数量在多项式对数因子内是充分必要的。下界甚至对随机化预测器成立，且上界可通过在线到批处理的归约实现随机化预测器。这一结果将多校准的样本复杂度与边际校准（其复杂度为 $\widetildeΘ(\varepsilon^{-2})$）区分开来，表明在批处理环境下均值-ECE多校准与在线环境下同样困难，而边际校准则严格更困难于在线环境。相反，我们观察到当 $κ= 0$ 时，多校准的样本复杂度仍为 $\widetildeΘ(\varepsilon^{-2})$，呈现出锐利的阈值现象。更一般地，对于所有 $1 \le p \le 2$ 的加权 $L_p$ 多校准度量，我们建立了匹配的上界和下界（在多项式对数因子内），最优指数为 $3/p$。我们还将下界模板扩展到一类规则的可引发性质，并结合 Hu 等人（2025）的在线界上界，得到了校准性质（包括期望分位数和有界密度分位数）的匹配样本复杂度界。