Safety-critical environments are inherently dynamic. Distribution shifts, emerging vulnerabilities, and evolving requirements demand continuous updates to machine learning models. Yet even benign parameter updates can have unintended consequences, such as catastrophic forgetting in classical models or alignment drift in foundation models. Existing heuristic approaches (e.g., regularization, parameter isolation) can mitigate these effects but cannot certify that updated models continue to satisfy required performance specifications. We address this problem by introducing a framework for provably safe model updates. Our approach first formalizes the problem as computing the largest locally invariant domain (LID): a connected region in parameter space where all points are certified to satisfy a given specification. While exact maximal LID computation is intractable, we show that relaxing the problem to parameterized abstract domains (orthotopes, zonotopes) yields a tractable primal-dual formulation. This enables efficient certification of updates - independent of the data or algorithm used - by projecting them onto the safe domain. Our formulation further allows computation of multiple approximately optimal LIDs, incorporation of regularization-inspired biases, and use of lookahead data buffers. Across continual learning and foundation model fine-tuning benchmarks, our method matches or exceeds heuristic baselines for avoiding forgetting while providing formal safety guarantees.
翻译:安全关键型环境本质上是动态的。分布漂移、新出现的安全漏洞及不断演变的需求要求对机器学习模型进行持续更新。然而即便是良性的参数更新也可能产生意外后果,例如经典模型中的灾难性遗忘或基础模型中的对齐漂移。现有启发式方法(如正则化、参数隔离)虽能缓解这些影响,但无法证明更新后的模型仍能满足所需的性能规约。针对该问题,我们提出了一种可证明安全的模型更新框架。该方法首先将问题形式化为计算最大局部不变域(LID):参数空间中所有点均被证明满足给定规约的连通区域。尽管精确的最大LID计算是棘手的,但通过将问题松弛至参数化抽象域(正交体、多面体),我们推导出易处理的原始-对偶形式。这使得无论使用何种数据或算法,均可通过将更新投影至安全域实现高效认证。我们的框架进一步支持计算多个近似最优LID、融入正则化启发式偏置以及使用前瞻数据缓冲区。在持续学习与基础模型微调基准测试中,本方法在提供形式化安全保证的同时,在避免遗忘方面达到或超越了启发式基线方法的性能。