Towards understanding the statistical complexity of learning from heterogeneous sources, we study the problem of multi-distribution learning. Given $k$ data sources, the goal is to output a classifier for each source by exploiting shared structure to reduce sample complexity. We focus on the bounded label noise setting to determine whether the fast $1/ε$ rates achievable in single-task learning extend to this regime with minimal dependence on $k$. Surprisingly, we show that this is not the case. We demonstrate that learning across $k$ distributions inherently incurs slow rates scaling with $k/ε^2$, even under constant noise levels, unless each distribution is learned separately. A key technical contribution is a structured hypothesis-testing framework that captures the statistical cost of certifying near-optimality under bounded noise-a cost we show is unavoidable in the multi-distribution setting. Finally, we prove that when competing with the stronger benchmark of each distribution's optimal Bayes error, the sample complexity incurs a \textit{multiplicative} penalty in $k$. This establishes a \textit{statistical} separation between random classification noise and Massart noise, highlighting a fundamental barrier unique to learning from multiple sources.
翻译:为理解从异构数据源学习的统计复杂性,我们研究了多分布学习问题。给定$k$个数据源,目标是通过利用共享结构降低样本复杂度,为每个源输出一个分类器。我们聚焦于有界标签噪声场景,以探究单任务学习中可达的快速$1/ε$收敛率是否能在对$k$依赖最小的情况下扩展至该场景。令人惊讶的是,我们证明事实并非如此。研究表明,除非对每个分布单独学习,否则跨$k$个分布的学习本质上会产生按$k/ε^2$缩放的慢收敛率,即使在恒定噪声水平下亦然。一个关键的技术贡献是构建了结构化假设检验框架,该框架刻画了在有界噪声下验证近似最优性的统计代价——我们证明该代价在多分布场景中不可避免。最后,我们证明当以各分布最优贝叶斯误差这一更强基准为竞争目标时,样本复杂度将承受$k$的\textit{乘性惩罚}。这确立了随机分类噪声与Massart噪声之间的\textit{统计}分离,揭示了多源学习特有的根本性障碍。