Multi-distribution learning (MDL), which seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions, has emerged as a unified framework in response to the evolving demand for robustness, fairness, multi-group collaboration, etc. Achieving data-efficient MDL necessitates adaptive sampling, also called on-demand sampling, throughout the learning process. However, there exist substantial gaps between the state-of-the-art upper and lower bounds on the optimal sample complexity. Focusing on a hypothesis class of Vapnik-Chervonenkis (VC) dimension $d$, we propose a novel algorithm that yields an $varepsilon$-optimal randomized hypothesis with a sample complexity on the order of $(d+k)/\varepsilon^2$ (modulo some logarithmic factor), matching the best-known lower bound. Our algorithmic ideas and theory have been further extended to accommodate Rademacher classes. The proposed algorithms are oracle-efficient, which access the hypothesis class solely through an empirical risk minimization oracle. Additionally, we establish the necessity of randomization, unveiling a large sample size barrier when only deterministic hypotheses are permitted. These findings successfully resolve three open problems presented in COLT 2023 (i.e., Awasthi et al., (2023, Problem 1, 3 and 4)).
翻译:多分布学习(MDL)旨在学习一个共享模型,以最小化在$k$个不同数据分布上的最坏情况风险,已成为应对鲁棒性、公平性、多群体协作等不断发展的需求的统一框架。实现数据高效的MDL需要在整个学习过程中进行自适应采样(也称为按需采样)。然而,现有最优样本复杂度的上界与下界之间存在显著差距。针对Vapnik-Chervonenkis(VC)维数为$d$的假设类,我们提出了一种新算法,能以$(d+k)/\varepsilon^2$量级的样本复杂度(忽略对数因子)生成$\varepsilon$-最优随机化假设,与已知最佳下界相匹配。我们的算法思想与理论已进一步推广至Rademacher类。所提算法具有预言机高效性,仅通过经验风险最小化预言机访问假设类。此外,我们揭示了随机化的必要性,指出在仅允许确定性假设时存在大样本量瓶颈。这些发现成功解决了COLT 2023中提出的三个开放问题(即Awasthi等人(2023,问题1、3和4))。