Layer normalization (LN) is a widely adopted deep learning technique especially in the era of foundation models. Recently, LN has been shown to be surprisingly effective in federated learning (FL) with non-i.i.d. data. However, exactly why and how it works remains mysterious. In this work, we reveal the profound connection between layer normalization and the label shift problem in federated learning. To understand layer normalization better in FL, we identify the key contributing mechanism of normalization methods in FL, called feature normalization (FN), which applies normalization to the latent feature representation before the classifier head. Although LN and FN do not improve expressive power, they control feature collapse and local overfitting to heavily skewed datasets, and thus accelerates global training. Empirically, we show that normalization leads to drastic improvements on standard benchmarks under extreme label shift. Moreover, we conduct extensive ablation studies to understand the critical factors of layer normalization in FL. Our results verify that FN is an essential ingredient inside LN to significantly improve the convergence of FL while remaining robust to learning rate choices, especially under extreme label shift where each client has access to few classes. Our code is available at \url{https://github.com/huawei-noah/Federated-Learning/tree/main/Layer_Normalization}.
翻译:层归一化(LN)是一种广泛采用的深度学习技术,尤其在基础模型时代。近期研究表明,LN在非独立同分布数据的联邦学习(FL)中具有惊人的有效性。然而,其工作原理与深层机制仍未明确。本文揭示了层归一化与联邦学习中标签偏移问题之间的深刻关联。为更好理解LN在FL中的作用,我们识别出归一化方法的关键贡献机制——特征归一化(FN),该方法在分类器头部之前对潜在特征表示进行归一化。尽管LN与FN并未提升表达能力,但它们能控制特征坍缩和局部过拟合现象,从而在高度偏斜的数据集上加速全局训练。实验表明,在极端标签偏移的标准基准测试中,归一化显著提升了性能。此外,我们通过大量消融研究剖析了层归一化在FL中的关键要素。结果证实,FN是LN内部实现FL收敛性显著提升的核心组件,且对学习率选择保持鲁棒性,尤其在每个客户端仅能访问少数类别的极端标签偏移场景下。我们的代码开源在\url{https://github.com/huawei-noah/Federated-Learning/tree/main/Layer_Normalization}。