Large Language Models (LLMs) and their multimodal extensions are becoming increasingly popular. One common approach to enable multimodality is to cascade domain-specific encoders with an LLM, making the resulting model inherit vulnerabilities from all of its components. In this work, we present the first systematic study of audio backdoor attacks against speech language models. We demonstrate its effectiveness across four speech encoders and three datasets, covering four tasks: automatic speech recognition (ASR), speech emotion recognition, and gender and age prediction. The attack consistently achieves high success rates, ranging from 90.76% to 99.41%. To better understand how backdoors propagate, we conduct a component-wise analysis to identify the most vulnerable stages of the pipeline. Finally, we propose a fine-tuning-based defense that mitigates the threat of poisoned pretrained encoders.
翻译:大型语言模型(LLMs)及其多模态扩展正日益普及。实现多模态的一种常见方法是将领域特定编码器与LLM级联,使得最终模型继承其所有组件的脆弱性。本文首次系统研究了针对语音语言模型的音频后门攻击。我们在四种语音编码器和三个数据集上验证了攻击的有效性,涵盖自动语音识别(ASR)、语音情感识别、性别与年龄预测四项任务。该攻击持续实现90.76%至99.41%的高成功率。为深入理解后门传播机制,我们通过组件级分析识别了流程中最脆弱的环节。最后,我们提出一种基于微调的防御方法,以缓解预训练编码器被投毒带来的威胁。