How many of a neural network's parameters actually encode task-specific information? We investigate this question with LottaLoRA, a training paradigm in which every backbone weight is drawn at random and frozen; only low-rank LoRA adapters are trained. Across nine benchmarks spanning diverse architecture families from single-layer classifiers to 900M parameter Transformers low-rank adapters over frozen random backbones recover 96-100% of fully trained performance while training only 0.5-40% of the parameters. The task-specific signal therefore occupies a subspace orders of magnitude smaller than the full parameter count suggests. Three mechanistic findings underpin this result:(1) the frozen backbone is actively exploited when static the learned scaling~$β$ remains strictly positive across all architectures but when the scaffold is destabilized, the optimizer silences it and the LoRA factors absorb all task information; (2) the frozen backbone is preferable but interchangeable any random initialization works equally well, provided it remains fixed throughout training; and (3) the minimum LoRA rank at which performance saturates estimates the intrinsic dimensionality of the task, reminiscent of the number of components retained in Principal Component Analysis (PCA). The construction is formally analogous to Reservoir Computing unfolded along the depth axis of a feedforward network. Because the backbone is determined by a random seed alone, models can be distributed as adapters plus seed a footprint that grows with task complexity, not model size, so that storage and memory savings compound as architectures scale.
翻译:一个神经网络中有多少参数真正编码了任务特定信息?我们通过 LottaLoRA 这一训练范式来探究该问题,其中所有骨干网络权重均为随机初始化并冻结,仅训练低秩 LoRA 适配器。在涵盖从单层分类器到 900M 参数 Transformer 的多种架构家族的九个基准测试中,基于冻结随机骨架的低秩适配器仅训练 0.5-40% 的参数,却恢复了全参数训练的 96-100% 性能。因此,任务特定信号占据的子空间比全参数计数所暗示的小数个数量级。三个机制性发现支撑了这一结果:(1) 静态情况下,骨架被主动利用,学习的缩放参数 β 在所有架构中保持严格正值;但当骨架失稳时,优化器会将其静默,LoRA 因子则吸收所有任务信息;(2) 冻结的骨架是可取的,但并非不可替代——只要在整个训练过程中保持固定,任意随机初始化效果同样出色;(3) 性能饱和时的最小 LoRA 秩估计了任务的本征维度,类似于主成分分析(PCA)中保留的成分数量。该构建在形式上类似于沿前馈网络深度轴展开的储层计算。由于骨架仅由随机种子决定,模型可以作为"适配器加种子"进行分发——其占用空间随任务复杂度增长,而非模型规模,因此在架构扩展时存储和内存的节省呈复合效应。