WARP: Guaranteed Inner-Layer Repair of NLP Transformers

Transformer-based NLP models remain vulnerable to adversarial perturbations, yet existing repair methods face a fundamental trade-off: gradient-based approaches offer flexibility but lack verifiability and often overfit; methods that do provide repair guarantees are restricted to the final layer or small networks, significantly limiting the parameter search space available for repair. We present WARP (Weight-Adjusted Repair with Provability), a constraint-based repair framework that extends repair beyond the last layer of Transformer models. WARP formulates repair as a convex quadratic program derived from a first-order linearization of the logit gap, enabling tractable optimization over a high-dimensional parameter space. Under the condition that the first-order approximation holds, this formulation induces three per-sample guarantees: (i) a positive margin constraint ensuring correct classification on repaired inputs, (ii) preservation constraints over a designated remain set, and (iii) a certified robustness radius derived from Lipschitz continuity. To ensure feasibility across varying model architectures, we introduce a sensitivity-based preprocessing step that conditions the optimization landscape accordingly. We further show that the iterative optimization procedure converges to solutions satisfying all repair constraints under mild assumptions. Empirical evaluation on encoder-only Transformers with varying layer architectures validates that these guarantees hold in practice while improving robustness to adversarial inputs. Our results demonstrate that guaranteed, generalizable Transformer repair is achievable through principled constraint-based optimization.

翻译：基于Transformer的NLP模型仍然容易受到对抗性扰动的影响，而现有修复方法面临一个根本性权衡：基于梯度的方案虽灵活但缺乏可验证性且易过拟合；提供修复保证的方法则局限于最后一层或小规模网络，显著限制了修复可用的参数搜索空间。我们提出WARP（可证明的权重调整修复），一种基于约束的修复框架，将修复范围扩展至Transformer模型的最后一层之外。WARP将修复形式化为源自对数几率差一阶线性化的凸二次规划，从而实现对高维参数空间的可处理优化。在一阶近似成立条件下，该公式产生三种每样本保证：(i)确保修复输入正确分类的正边界约束，(ii)指定保留集上的保持约束，以及(iii)基于Lipschitz连续性推导的认证鲁棒半径。为确保跨不同模型架构的可行性，我们引入基于灵敏度的预处理步骤，据此优化优化问题的求解环境。我们进一步证明，在温和假设下，迭代优化过程收敛至满足所有修复约束的解。对具有不同层架构的编码器-only Transformer进行的实验验证表明，这些保证在实践中成立，同时提升了对抗性输入的鲁棒性。我们的结果表明，通过基于原则的约束优化，可实现有保证且可泛化的Transformer修复。