Stochastic Approximation Approaches to Group Distributionally Robust Optimization

This paper investigates group distributionally robust optimization (GDRO), with the purpose to learn a model that performs well over $m$ different distributions. First, we formulate GDRO as a stochastic convex-concave saddle-point problem, and demonstrate that stochastic mirror descent (SMD), using $m$ samples in each iteration, achieves an $O(m (\log m)/\epsilon^2)$ sample complexity for finding an $\epsilon$-optimal solution, which matches the $\Omega(m/\epsilon^2)$ lower bound up to a logarithmic factor. Then, we make use of techniques from online learning to reduce the number of samples required in each round from $m$ to $1$, keeping the same sample complexity. Specifically, we cast GDRO as a two-players game where one player simply performs SMD and the other executes an online algorithm for non-oblivious multi-armed bandits. Next, we consider a more practical scenario where the number of samples that can be drawn from each distribution is different, and propose a novel formulation of weighted GDRO, which allows us to derive distribution-dependent convergence rates. Denote by $n_i$ the sample budget for the $i$-th distribution, and assume $n_1 \geq n_2 \geq \cdots \geq n_m$. In the first approach, we incorporate non-uniform sampling into SMD such that the sample budget is satisfied in expectation, and prove that the excess risk of the $i$-th distribution decreases at an $O(\sqrt{n_1 \log m}/n_i)$ rate. In the second approach, we use mini-batches to meet the budget exactly and also reduce the variance in stochastic gradients, and then leverage stochastic mirror-prox algorithm, which can exploit small variances, to optimize a carefully designed weighted GDRO problem. Under appropriate conditions, it attains an $O((\log m)/\sqrt{n_i})$ convergence rate, which almost matches the optimal $O(\sqrt{1/n_i})$ rate of only learning from the $i$-th distribution with $n_i$ samples.

翻译：本文研究组分布鲁棒优化（GDRO），旨在学习一个在 $m$ 个不同分布上表现良好的模型。首先，我们将GDRO建模为一个随机凸凹鞍点问题，并证明采用随机镜像下降法（SMD），每次迭代使用 $m$ 个样本，在寻找 $\epsilon$-最优解时的样本复杂度为 $O(m (\log m)/\epsilon^2)$，这与下界 $\Omega(m/\epsilon^2)$ 仅相差一个对数因子。接着，我们利用在线学习技术，将每轮所需样本数从 $m$ 减少到 $1$，同时保持相同的样本复杂度。具体而言，我们将GDRO视为一个双人博弈：一个玩家仅执行SMD，另一个玩家则执行针对非遗忘型多臂赌博机的在线算法。然后，我们考虑一个更实际的场景，即每个分布可抽取的样本数不同，并提出加权GDRO的新颖形式，从而导出分布相关的收敛速率。设 $n_i$ 为第 $i$ 个分布的样本预算，并假设 $n_1 \geq n_2 \geq \cdots \geq n_m$。第一种方法中，我们将非均匀采样引入SMD，使得样本预算在期望意义上得到满足，并证明第 $i$ 个分布的过剩风险以 $O(\sqrt{n_1 \log m}/n_i)$ 速率递减。第二种方法中，我们使用小批量（mini-batches）精确满足预算并降低随机梯度的方差，进而利用能够利用小方差的随机镜像近似算法（stochastic mirror-prox）来优化精心设计的加权GDRO问题。在适当条件下，该算法可实现 $O((\log m)/\sqrt{n_i})$ 的收敛速率，几乎匹配仅使用 $n_i$ 个样本学习第 $i$ 个分布时的最优速率 $O(\sqrt{1/n_i})$。