Large language models (LLMs) are known to exhibit brittle behavior under adversarial prompts and jailbreak attacks, even after extensive alignment and fine-tuning. This fragility reflects a broader challenge of modern neural language models: small, carefully structured perturbations in high-dimensional input spaces can induce large and unpredictable changes in internal semantic representations and output. We investigate monotonicity as an architectural inductive bias for improving the robustness of Transformer-based language models. Monotonicity constrains semantic transformations so that strengthening information, evidence, or constraints cannot lead to regressions in the corresponding internal representations. Such order-preserving behavior has long been exploited in control and safety-critical systems to simplify reasoning and improve robustness, but has traditionally been viewed as incompatible with the expressivity required by neural language models. We show that this trade-off is not inherent. By enforcing monotonicity selectively in the feed-forward sublayers of sequence-to-sequence Transformers -- while leaving attention mechanisms unconstrained -- we obtain monotone language models that preserve the performance of their pretrained counterparts. This architectural separation allows negation, contradiction, and contextual interactions to be introduced explicitly through attention, while ensuring that subsequent semantic refinement is order-preserving. Empirically, monotonicity substantially improves robustness: adversarial attack success rates drop from approximately 69% to 19%, while standard summarization performance degrades only marginally.
翻译:大型语言模型(LLMs)在对抗性提示和越狱攻击下常表现出脆弱性,即使经过广泛的校准与微调。这种脆弱性反映了现代神经语言模型面临的一个更广泛挑战:高维输入空间中微小且精心构造的扰动,可能导致内部语义表示和输出的巨大且不可预测的变化。本研究探讨将单调性作为一种架构归纳偏置,用于提升基于Transformer的语言模型的鲁棒性。单调性约束语义变换,使得强化信息、证据或约束不会导致相应内部表示的退化。这种保序行为长期以来在控制和安全关键系统中被用于简化推理并提升鲁棒性,但传统上被认为与神经语言模型所需的表达能力不相容。我们证明这种权衡并非固有。通过在序列到序列Transformer的前馈子层中选择性实施单调性——同时保持注意力机制不受约束——我们获得了单调语言模型,其性能与预训练模型相当。这种架构分离允许通过注意力机制显式引入否定、矛盾及上下文交互,同时确保后续的语义细化是保序的。实验表明,单调性显著提升了鲁棒性:对抗性攻击成功率从约69%降至19%,而标准摘要性能仅略有下降。