We introduce NOBLE (Nonlinear lOw-rank Branch for Linear Enhancement), an architectural augmentation that adds nonlinear low-rank branches to transformer linear layers. Unlike LoRA and other parameter-efficient fine-tuning (PEFT) methods, NOBLE is designed for pretraining from scratch. The branch is a permanent part of the architecture as opposed to an adapter for finetuning on top of frozen weights. The branch computes σ(xWdown)Wup where σ is a learnable nonlinearity. We evaluate several activation functions and find that CosNet, a two-layer cosine nonlinearity with learnable frequency and phase with a linear projection in between them in the bottleneck space, performs best. NOBLE achieves substantial improvements with minimal overhead: up to 1.47x step speedup to reach baseline eval loss (up to 32% fewer training steps), with as low as 4% additional parameters and 7% step time overhead, resulting in up to 1.22x net wallclock speedup. Experiments on LLMs (250M and 1.5B parameters), BERT, VQGAN, and ViT consistently show improved training efficiency. We identify one caveat: Mixup/CutMix augmentation interferes with NOBLE's benefits in Imagenet classification along with other stochastic augmentations, but when disabled, ViT also improves. This discrepancy is possibly explained by regularization techniques that encourage smoother fits to the target function while NOBLE may specialize more in sharper aspects of the target function.
翻译:本文提出NOBLE(非线性低秩线性增强分支),一种通过向Transformer线性层添加非线性低秩分支的架构增强方法。与LoRA及其他参数高效微调方法不同,NOBLE专为从头预训练而设计。该分支作为架构的永久组成部分,而非基于冻结权重的微调适配器。分支计算过程为σ(xWdown)Wup,其中σ为可学习的非线性函数。我们评估了多种激活函数,发现CosNet——一种在瓶颈空间中具有可学习频率与相位、中间包含线性投影的双层余弦非线性函数——表现最佳。NOBLE以极低开销实现显著改进:达到基线评估损失所需的训练步骤最高加速1.47倍(训练步骤减少最高达32%),仅增加4%的参数量与7%的单步时间开销,最终获得最高1.22倍的净墙钟加速比。在LLM(2.5亿/15亿参数)、BERT、VQGAN和ViT上的实验均显示训练效率提升。我们注意到一个注意事项:在ImageNet分类任务中,Mixup/CutMix等随机增强会干扰NOBLE的收益,但当禁用这些增强时,ViT同样能获得改进。这种差异可能源于正则化技术倾向于促使模型平滑拟合目标函数,而NOBLE可能更专注于目标函数中更尖锐的特征。