We systematically investigate the parameter-efficient fine-tuning design space under practical data and compute constraints, and propose D2-LoRA. D2-LoRA achieves 76.4 percent average accuracy across eight question answering and reading comprehension benchmarks using only 5k training samples per task and two epochs, while preserving algebraic mergeability at inference with near-exact numerical equivalence. The method combines signed low-rank residual updates with additive and subtractive components, together with a train-time column-wise projection that keeps each column close to its original norm. After training, the adapter is merged into a single weight matrix, adding zero inference latency. Compared with LoRA, D2-LoRA improves average accuracy by 2.2 percentage points; at matched parameter counts (LoRA rank 2r versus D2-LoRA rank r), the improvement is 1.6 points, indicating gains from architectural design rather than increased parameterization. Compared with DoRA, it matches or exceeds performance on most tasks. Beyond QA and reading comprehension, D2-LoRA improves generative tasks (plus 1.2 ROUGE-L and plus 1.1 percent win rate) and shows 36 percent lower training volatility. The merge preserves numerical fidelity (mean gap about 0.03 percentage points) and recovers about 1.91x evaluation throughput. Training overhead is 19 percent, comparable to DoRA, and decreases with longer input sequences. We provide a geometric analysis explaining how the projection stabilizes training, together with ablation studies isolating the contribution of each design component.
翻译:我们在实际数据和计算约束下,系统性地研究了参数高效微调的设计空间,并提出了D2-LoRA。该方法在每任务仅使用5k训练样本和两个训练轮次的情况下,在八个问答与阅读理解基准测试中取得了76.4%的平均准确率,同时在推理时保持了代数可合并性,并实现了近乎精确的数值等价。该方法结合了带符号的低秩残差更新(包含加性和减性分量)以及一个训练时的列向投影,该投影使每列保持接近其原始范数。训练完成后,适配器被合并为单个权重矩阵,不增加任何推理延迟。与LoRA相比,D2-LoRA将平均准确率提高了2.2个百分点;在参数数量匹配的情况下(LoRA秩为2r对比D2-LoRA秩为r),提升为1.6个百分点,这表明性能增益源于架构设计而非参数量的增加。与DoRA相比,其在大多数任务上达到或超越了性能表现。除问答与阅读理解外,D2-LoRA在生成任务上也有所提升(ROUGE-L提高1.2分,胜率提高1.1%),并显示出训练波动性降低36%。合并操作保持了数值保真度(平均差距约0.03个百分点),并恢复了约1.91倍的评估吞吐量。训练开销为19%,与DoRA相当,且随着输入序列变长而降低。我们提供了几何分析以解释投影如何稳定训练,并辅以消融实验来分离每个设计组件的贡献。