This work analyzes the training dynamics of Image Restoration (IR) Transformers and uncovers a critical yet overlooked issue: conventional LayerNorm (LN) drives feature magnitudes to diverge to a million scale and collapses channel-wise entropy. We analyze this in the perspective of networks attempting to bypass LN's constraints that conflict with IR tasks. Accordingly, we address two misalignments between LN and IR: 1) per-token normalization disrupts spatial correlations, and 2) input-independent scaling discards input-specific statistics. To address this, we propose Image Restoration Transformer Tailored Layer Normalization i-LN, a simple drop-in replacement that normalizes features holistically and adaptively rescales them per input. We provide theoretical insights and empirical evidence that this simple design effectively leads to both improved training dynamics and thereby improved performance, validated by extensive experiments.
翻译:本研究分析了图像复原(IR)Transformer的训练动态,揭示了一个关键但被忽视的问题:传统的层归一化(LN)会导致特征量级发散至百万级规模,并使通道间熵值坍缩。我们从网络试图绕过与IR任务相冲突的LN约束这一视角对此现象进行解析。据此,我们指出了LN与IR任务之间的两个错配问题:1)基于单令牌的归一化破坏了空间相关性;2)与输入无关的缩放丢弃了输入特定的统计信息。为解决此问题,我们提出了面向图像复原Transformer的定制化层归一化(i-LN),这是一种简单的即插即用替代方案,能够对特征进行整体归一化,并根据每个输入自适应地重新缩放。我们通过理论分析和实验证据表明,这一简洁的设计能有效改善训练动态,从而提升模型性能,该结论已通过大量实验验证。