The estimation of cumulative distribution functions (CDF) is an important learning task with a great variety of downstream applications, such as risk assessments in predictions and decision making. In this paper, we study functional regression of contextual CDFs where each data point is sampled from a linear combination of context dependent CDF basis functions. We propose functional ridge-regression-based estimation methods that estimate CDFs accurately everywhere. In particular, given $n$ samples with $d$ basis functions, we show estimation error upper bounds of $\widetilde O(\sqrt{d/n})$ for fixed design, random design, and adversarial context cases. We also derive matching information theoretic lower bounds, establishing minimax optimality for CDF functional regression. Furthermore, we remove the burn-in time in the random design setting using an alternative penalized estimator. Then, we consider agnostic settings where there is a mismatch in the data generation process. We characterize the error of the proposed estimators in terms of the mismatched error, and show that the estimators are well-behaved under model mismatch. Moreover, to complete our study, we formalize infinite dimensional models where the parameter space is an infinite dimensional Hilbert space, and establish a self-normalized estimation error upper bound for this setting. Notably, the upper bound reduces to the $\widetilde O(\sqrt{d/n})$ bound when the parameter space is constrained to be $d$-dimensional. Our comprehensive numerical experiments validate the efficacy of our estimation methods in both synthetic and practical settings.
翻译:累积分布函数(CDF)的估计是一项重要的学习任务,在预测与决策中的风险评估等众多下游应用中具有广泛价值。本文研究上下文CDF的泛函回归问题,其中每个数据点均从上下文相关CDF基函数的线性组合中采样得到。我们提出基于泛函岭回归的估计方法,能够精确估计所有位置上的CDF。具体而言,给定包含n个样本、d个基函数的数据集,我们在固定设计、随机设计及对抗性上下文三种情形下,证明了误差上界为$\widetilde O(\sqrt{d/n})$。同时,我们推导了匹配的信息论下界,确立了CDF泛函回归的极小化最优性。此外,我们利用另一种惩罚估计器消除了随机设计中的预热时间。随后,我们考虑数据生成过程存在偏差的不可知论场景,以失配误差刻画所提估计器的误差,并证明其在模型失配下仍具有良好表现。为完善研究,我们形式化了参数空间为无穷维希尔伯特空间的无穷维模型,并建立了该场景下的自归一化估计误差上界。值得注意的是,当参数空间约束为d维时,该上界简化为$\widetilde O(\sqrt{d/n})$。全面的数值实验验证了所提估计方法在合成数据与实际问题中的有效性。