Layer normalization (LN) is a widely adopted deep learning technique especially in the era of foundation models. Recently, LN has been shown to be surprisingly effective in federated learning (FL) with non-i.i.d. data. However, exactly why and how it works remains mysterious. In this work, we reveal the profound connection between layer normalization and the label shift problem in federated learning. To understand layer normalization better in FL, we identify the key contributing mechanism of normalization methods in FL, called feature normalization (FN), which applies normalization to the latent feature representation before the classifier head. Although LN and FN do not improve expressive power, they control feature collapse and local overfitting to heavily skewed datasets, and thus accelerates global training. Empirically, we show that normalization leads to drastic improvements on standard benchmarks under extreme label shift. Moreover, we conduct extensive ablation studies to understand the critical factors of layer normalization in FL. Our results verify that FN is an essential ingredient inside LN to significantly improve the convergence of FL while remaining robust to learning rate choices, especially under extreme label shift where each client has access to few classes.
翻译:层归一化(LN)是一种广泛采用的深度学习技术,尤其是在基础模型时代。最近,研究表明LN在非独立同分布数据的联邦学习(FL)中出奇地有效。然而,其确切原理和作用机制仍不清楚。本工作揭示了层归一化与联邦学习中标签偏移问题之间的深刻联系。为更好地理解FL中的层归一化,我们识别出归一化方法在FL中的关键作用机制,称为特征归一化(FN),即在分类器头部之前对潜在特征表示进行归一化。尽管LN和FN并未提升表达能力,但它们能控制特征坍缩和局部过拟合至严重偏斜的数据集,从而加速全局训练。实验证明,在极端标签偏移下,归一化能显著提升标准基准性能。此外,我们通过大量消融实验理解FL中层归一化的关键因素。结果证实FN是LN内部的重要成分,能显著改善FL的收敛性,同时保持对学习率选择的鲁棒性,尤其当每个客户端仅能访问少数类别时。