Multi-distribution learning (MDL), which seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions, has emerged as a unified framework in response to the evolving demand for robustness, fairness, multi-group collaboration, etc. Achieving data-efficient MDL necessitates adaptive sampling, also called on-demand sampling, throughout the learning process. However, there exist substantial gaps between the state-of-the-art upper and lower bounds on the optimal sample complexity. Focusing on a hypothesis class of Vapnik-Chervonenkis (VC) dimension d, we propose a novel algorithm that yields an varepsilon-optimal randomized hypothesis with a sample complexity on the order of (d+k)/varepsilon^2 (modulo some logarithmic factor), matching the best-known lower bound. Our algorithmic ideas and theory are further extended to accommodate Rademacher classes. The proposed algorithms are oracle-efficient, which access the hypothesis class solely through an empirical risk minimization oracle. Additionally, we establish the necessity of randomization, revealing a large sample size barrier when only deterministic hypotheses are permitted. These findings resolve three open problems presented in COLT 2023 (i.e., citet[Problems 1, 3 and 4]{awasthi2023sample}).
翻译:多分布学习(MDL)旨在学习一个共享模型,以最小化 $k$ 个不同数据分布上的最坏情况风险,它已发展成为一个统一框架,以应对对鲁棒性、公平性、多群体协作等方面不断演进的需求。实现数据高效的 MDL 需要在学习过程中进行自适应采样(也称为按需采样)。然而,当前关于最优样本复杂度的最先进上界与下界之间存在显著差距。针对一个 Vapnik-Chervonenkis(VC)维数为 d 的假设类,我们提出了一种新颖算法,该算法能以 (d+k)/varepsilon^2 量级(模去某个对数因子)的样本复杂度,产生一个 varepsilon 最优的随机化假设,这与已知的最佳下界相匹配。我们的算法思想和理论进一步扩展以适应 Rademacher 类。所提出的算法是预言机高效的,仅通过经验风险最小化预言机来访问假设类。此外,我们确立了随机化的必要性,揭示了当只允许确定性假设时存在的大样本量障碍。这些发现解决了 COLT 2023 中提出的三个开放问题(即 citet[Problems 1, 3 and 4]{awasthi2023sample})。