Training on mixtures of data distributions is now common in many modern machine learning pipelines, useful for performing well on several downstream tasks. Group distributionally robust optimization (group DRO) is one popular way to learn mixture weights for training a specific model class, but group DRO methods suffer for non-linear models due to non-convex loss functions and when the models are non-parametric. We address these challenges by proposing to solve a more general DRO problem, giving a method we call MixMax. MixMax selects mixture weights by maximizing a particular concave objective with entropic mirror ascent, and, crucially, we prove that optimally fitting this mixture distribution over the set of bounded predictors returns a group DRO optimal model. Experimentally, we tested MixMax on a sequence modeling task with transformers and on a variety of non-parametric learning problems. In all instances MixMax matched or outperformed the standard data mixing and group DRO baselines, and in particular, MixMax improved the performance of XGBoost over the only baseline, data balancing, for variations of the ACSIncome and CelebA annotations datasets.
翻译:在现代机器学习流程中,基于数据分布混合的训练已变得普遍,这对于在多个下游任务上取得良好性能非常有用。组分布鲁棒优化(group DRO)是学习用于训练特定模型类的混合权重的一种流行方法,但由于非凸损失函数以及当模型为非参数时,组DRO方法对于非线性模型效果不佳。我们通过提出解决一个更广义的DRO问题来应对这些挑战,给出了一种我们称之为MixMax的方法。MixMax通过使用熵镜像上升法最大化一个特定的凹目标函数来选择混合权重,并且关键的是,我们证明了在有界预测器集合上最优地拟合该混合分布会返回一个组DRO最优模型。在实验中,我们在Transformer序列建模任务以及各种非参数学习问题上测试了MixMax。在所有实例中,MixMax都达到或超越了标准数据混合和组DRO基线的性能,特别是,对于ACSIncome和CelebA注释数据集的变体,MixMax相较于唯一的基线——数据平衡,提升了XGBoost的性能。