Research on adversarial robustness in language models is currently fragmented across applications and attacks, obscuring shared vulnerabilities. In this work, we propose unifying the study of adversarial robustness in text scoring models spanning dense retrievers, rerankers, and reward models. This motivates adapting both attacks and adversarial training methods across model roles. Unlike open-ended generation, text scoring failures are directly testable: an attack succeeds when an irrelevant or rejected text outscores a relevant or chosen one. Using this principled lens of text scoring, we demonstrate that current adversarial training formulations for language models are often short-sighted, failing to effectively generalize across attacks. To address this, we introduce multiple adversarial training methods for text scoring models and show that combining complementary training methods can yield strong robustness while also improving task effectiveness. We also highlight the practical value of our approach for RLHF, showing that our adversarially trained reward models mitigate reward hacking and support the training of better-aligned LLMs. We provide our code and models for further study.
翻译:当前语言模型对抗鲁棒性的研究在不同应用和攻击方式间呈碎片化状态,掩盖了其共有的脆弱性。在本工作中,我们提出统一研究涵盖稠密检索器、重排序器和奖励模型的文本评分模型的对抗鲁棒性。这促使我们针对不同模型角色调整攻击方法与对抗训练方法。与开放式生成任务不同,文本评分失败可直接测试:当无关或被拒绝的文本评分高于相关或被选中的文本时,攻击即告成功。借助文本评分这一原则性视角,我们证明当前语言模型的对抗训练方案通常缺乏远见,无法有效泛化至不同攻击方式。为解决此问题,我们为文本评分模型引入了多种对抗训练方法,并证明结合互补的训练方法能在提升任务效能的同时实现强大的鲁棒性。我们还强调了该方法在RLHF中的实用价值:经对抗训练的奖励模型能缓解奖励破解问题,并支持训练出更对齐的大型语言模型。我们公开了代码与模型以供进一步研究。