Test-time training (TTT) has recently emerged as a promising method to improve the reasoning abilities of large language models (LLMs), in which the model directly learns from test data without access to labels. However, this reliance on test data also makes TTT methods vulnerable to harmful prompt injections. In this paper, we investigate safety vulnerabilities of TTT methods, where we study a representative self-consistency-based test-time learning method: test-time reinforcement learning (TTRL), a recent TTT method that improves LLM reasoning by rewarding self-consistency using majority vote as a reward signal. We show that harmful prompt injection during TTRL amplifies the model's existing behaviors, i.e., safety amplification when the base model is relatively safe, and harmfulness amplification when it is vulnerable to the injected data. In both cases, there is a decline in reasoning ability, which we refer to as the reasoning tax. We also show that TTT methods such as TTRL can be exploited adversarially using specially designed "HarmInject" prompts to force the model to answer jailbreak and reasoning queries together, resulting in stronger harmfulness amplification. Overall, our results highlight that TTT methods that enhance LLM reasoning by promoting self-consistency can lead to amplification behaviors and reasoning degradation, highlighting the need for safer TTT methods.
翻译:测试时训练作为一种无需标签即可从测试数据中直接学习的方法,近年来已成为提升大语言模型推理能力的重要途径。然而,这种对测试数据的依赖也使测试时训练方法易受恶意提示注入的攻击。本文针对测试时训练方法的安全性脆弱性展开研究,重点关注一种基于自一致性的代表性测试时学习方法——测试时强化学习。该方法通过以多数投票作为奖励信号来激励自一致性,从而提升大语言模型的推理能力。研究发现,在测试时强化学习过程中注入恶意提示会放大模型的既有行为:当基础模型相对安全时会产生安全性放大效应,而当模型对注入数据敏感时则会产生危害性放大效应。在这两种情况下,模型的推理能力均会出现下降,我们将其称为推理税。研究还表明,攻击者可通过专门设计的"HarmInject"提示恶意利用测试时强化学习等方法,迫使模型同时处理越狱查询与推理查询,从而导致更强烈的危害性放大效应。总体而言,本研究表明:通过促进自一致性来增强大语言模型推理能力的测试时训练方法,可能引发行为放大与推理能力退化的问题,这凸显了开发更安全测试时训练方法的迫切需求。