LLM-based NLP systems typically work by embedding their input data into prompt templates which contain instructions and/or in-context examples, creating queries which are submitted to a LLM, and then parsing the LLM response in order to generate the system outputs. Prompt Injection Attacks (PIAs) are a type of subversion of these systems where a malicious user crafts special inputs which interfere with the prompt templates, causing the LLM to respond in ways unintended by the system designer. Recently, Sun and Miceli-Barone proposed a class of PIAs against LLM-based machine translation. Specifically, the task is to translate questions from the TruthfulQA test suite, where an adversarial prompt is prepended to the questions, instructing the system to ignore the translation instruction and answer the questions instead. In this test suite, we extend this approach to all the language pairs of the WMT 2024 General Machine Translation task. Moreover, we include additional attack formats in addition to the one originally studied.
翻译:基于LLM的自然语言处理系统通常通过以下方式工作:将输入数据嵌入到包含指令和/或上下文示例的提示模板中,构建提交给LLM的查询,随后解析LLM的响应以生成系统输出。提示注入攻击(PIAs)是对此类系统的一种破坏手段,恶意用户通过构造特殊输入干扰提示模板,导致LLM以系统设计者未预期的方式作出响应。近期,Sun和Miceli-Barone提出了一类针对基于LLM的机器翻译的提示注入攻击。具体而言,该任务要求翻译来自TruthfulQA测试集的问题,其中在问题前附加了对抗性提示,指示系统忽略翻译指令而直接回答问题。在本测试套件中,我们将该方法扩展至WMT 2024通用机器翻译任务的所有语言对。此外,我们在原始研究的基础上引入了更多攻击格式。