Continual post-training enables models to absorb emerging knowledge after deployment, but repeatedly updating shared parameters can accumulate weight drift, potentially causing catastrophic forgetting and degrading general capabilities. Retrieval-augmented generation avoids such parameter drift, yet often lacks the depth of parametric knowledge integration. In this paper, we propose ReGrad (Retrievable Gradients), a new paradigm that treats gradients as retrievable units of knowledge. ReGrad pre-computes document-specific gradients offline, stores them in an indexed Gradient Bank, and retrieves only query-relevant gradients at inference time for temporary weight adaptation. However, raw language-modeling gradients are optimized for token-level document reconstruction rather than for query-driven knowledge use. We therefore introduce a bi-level meta-learning objective that reshapes document-derived gradients into generalizable adaptation signals for downstream tasks. Experiments across general and domain-specific settings show that \textsc{ReGrad} outperforms CPT and RAG baselines, enabling scalable and reversible parametric knowledge injection without accumulating weight drift.
翻译:持续后训练使模型在部署后能够吸收新涌现的知识,但重复更新共享参数会累积权重漂移,可能导致灾难性遗忘并降低通用能力。检索增强生成虽可避免此类参数漂移,却往往缺乏参数化知识整合的深度。本文提出ReGrad(可检索梯度)新范式,将梯度视为可检索的知识单元。该方案离线预计算文档特定梯度,存储于索引化梯度库中,推理时仅检索与查询相关的梯度用于临时权重自适应。然而,原始语言建模梯度针对词元级文档重建而非查询驱动式知识利用进行优化。为此,我们引入双层元学习目标,将文档衍生梯度重塑为适用于下游任务的泛化适配信号。在通用与领域特定场景下的实验表明,ReGrad优于CPT与RAG基线方法,可在不累积权重漂移的情况下实现可扩展、可逆的参数化知识注入。