On-device fine-tuning enables privacy-preserving personalization of large language models, but mobile devices impose severe memory constraints, typically 6--12GB shared across all workloads. Existing approaches force a trade-off between exact gradients with high memory (MeBP) and low memory with noisy estimates (MeZO). We propose Memory-efficient Structured Backpropagation (MeSP), which bridges this gap by manually deriving backward passes that exploit LoRA's low-rank structure. Our key insight is that the intermediate projection $h = xA$ can be recomputed during backward at minimal cost since rank $r \ll d_{in}$, eliminating the need to store it. MeSP achieves 49\% average memory reduction compared to MeBP on Qwen2.5 models (0.5B--3B) while computing mathematically identical gradients. Our analysis also reveals that MeZO's gradient estimates show near-zero correlation with true gradients (cosine similarity $\approx$0.001), explaining its slow convergence. MeSP reduces peak memory from 361MB to 136MB for Qwen2.5-0.5B, enabling fine-tuning scenarios previously infeasible on memory-constrained devices.
翻译:设备端微调实现了大语言模型的隐私保护个性化,但移动设备施加了严格的内存限制——通常为6-12GB且需在所有工作负载间共享。现有方法迫使研究者在高内存的精确梯度计算(MeBP)与低内存的噪声估计(MeZO)之间做出权衡。本文提出内存高效结构化反向传播(MeSP),通过手动推导利用LoRA低秩结构的反向传播过程来弥合这一鸿沟。我们的核心发现是:由于秩满足$r \ll d_{in}$,中间投影$h = xA$可在反向传播过程中以极小代价重新计算,从而无需存储该变量。在Qwen2.5系列模型(0.5B--3B)上,MeSP相比MeBP实现了49%的平均内存降低,同时计算得到数学等价的梯度。分析还表明MeZO的梯度估计与真实梯度接近零相关(余弦相似度$\approx$0.001),这解释了其收敛缓慢的原因。对于Qwen2.5-0.5B模型,MeSP将峰值内存从361MB降至136MB,使得在内存受限设备上实现以往不可行的微调场景成为可能。