Data availability across domains often follows a long-tail distribution: a few domains have abundant data, while most face dat . a scarcity. This imbalance poses challenges in training language models uniformly across all domains. In our study, we focus on multilingual settings, where data sizes vary significantly between high- and low-resource languages. Common strategies to address this include upsampling low-resource languages (Temperature Sampling) or upweighting their loss (Scalarization). Although often considered equivalent, this assumption has not been proven, which motivates our study. Through both theoretical and empirical analysis, we identify the conditions under which these approaches are equivalent and when they diverge. Specifically, we demonstrate that these two methods are equivalent under full gradient descent, but this equivalence breaks down with stochastic gradient descent. Empirically, we observe that Temperature Sampling converges more quickly but is prone to overfitting. We argue that this faster convergence is likely due to the lower variance in gradient estimations, as shown theoretically. Based on these insights, we propose Cooldown, a strategy that reduces sampling temperature during training, accelerating convergence without overfitting to low-resource languages. Our method is competitive with existing data re-weighting and offers computational efficiency.
翻译:跨领域数据可用性通常遵循长尾分布:少数领域拥有丰富数据,而大多数领域面临数据稀缺。这种不平衡性对在所有领域均匀训练语言模型提出了挑战。在本研究中,我们聚焦于多语言场景,其中高资源语言与低资源语言的数据规模差异显著。应对此问题的常见策略包括对低资源语言进行上采样(温度采样)或对其损失函数进行上加权(标量化)。尽管这两种方法常被视为等效,但该假设尚未得到证实,这正是我们研究的动机。通过理论与实证分析,我们明确了这些方法在何种条件下等效,在何种条件下产生分歧。具体而言,我们证明这两种方法在全梯度下降下是等效的,但在随机梯度下降下这种等效性会被打破。实证观察表明,温度采样收敛速度更快但容易过拟合。我们认为这种更快的收敛很可能源于梯度估计的方差较低——这在理论上已得到证明。基于这些发现,我们提出了Cooldown策略:在训练过程中逐步降低采样温度,从而在不使模型对低资源语言过拟合的前提下加速收敛。该方法与现有数据重加权技术性能相当,且具有更高的计算效率。