Data availability across domains often follows a long-tail distribution: a few domains have abundant data, while most face dat . a scarcity. This imbalance poses challenges in training language models uniformly across all domains. In our study, we focus on multilingual settings, where data sizes vary significantly between high- and low-resource languages. Common strategies to address this include upsampling low-resource languages (Temperature Sampling) or upweighting their loss (Scalarization). Although often considered equivalent, this assumption has not been proven, which motivates our study. Through both theoretical and empirical analysis, we identify the conditions under which these approaches are equivalent and when they diverge. Specifically, we demonstrate that these two methods are equivalent under full gradient descent, but this equivalence breaks down with stochastic gradient descent. Empirically, we observe that Temperature Sampling converges more quickly but is prone to overfitting. We argue that this faster convergence is likely due to the lower variance in gradient estimations, as shown theoretically. Based on these insights, we propose Cooldown, a strategy that reduces sampling temperature during training, accelerating convergence without overfitting to low-resource languages. Our method is competitive with existing data re-weighting and offers computational efficiency.
翻译:跨领域的数据可用性通常遵循长尾分布:少数领域拥有丰富数据,而大多数领域面临数据稀缺。这种不平衡性给在所有领域统一训练语言模型带来了挑战。在本研究中,我们关注多语言场景,其中高资源语言和低资源语言之间的数据规模差异显著。解决此问题的常见策略包括对低资源语言进行上采样(温度采样)或对其损失进行上加权(标量化)。尽管这两种方法常被视为等效,但这一假设尚未得到证明,这正是我们研究的动机。通过理论和实证分析,我们确定了这些方法等效的条件以及它们产生分歧的情形。具体而言,我们证明了这两种方法在全梯度下降下是等效的,但在随机梯度下降下这种等效性不再成立。实证观察表明,温度采样收敛更快但容易过拟合。我们认为,这种更快的收敛很可能源于梯度估计的方差较低,正如理论分析所示。基于这些发现,我们提出了Cooldown策略,该策略在训练过程中降低采样温度,从而在不使模型对低资源语言过拟合的前提下加速收敛。我们的方法与现有的数据重加权方法相比具有竞争力,并且计算效率更高。