Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse strong pretrained components, enable native speech-language interaction, or offer lightweight adaptation, they often suffer from transcript-interface latency, costly multimodal training, or sequential speech-language coupling. To address these limitations, we present AuRA, a method that distills audio encoding capability into the LLM. Specifically, AuRA feeds the same speech input to an ASR encoder (as a teacher) and a LoRA-adapted LLM (as a student) through a lightweight audio embedding layer, and uses layer-wise distillation to align the student's hidden states with corresponding teacher representations, thereby internalizing speech representations into lightweight LLM-side adaptations. Compared with cascaded and serial bridge methods, AuRA enables tighter speech-language joint modeling and efficient parallel end-to-end inference, while also reusing pretrained speech and language models rather than requiring large-scale multimodal training. On multiple speech-language benchmarks, AuRA consistently outperforms cascaded systems, speech-to-LLM adaptation baselines, and large-scale speech-language and multimodal models in both effectiveness and efficiency.
翻译:近期将大语言模型(LLM)扩展至语音输入的工作通常依赖级联式ASR-LLM管线、端到端语音-语言模型,或基于桥接/蒸馏的适配方法。这些路径虽能分别复用强预训练组件、实现原生语音-语言交互、或提供轻量级适配,但往往存在转录接口延迟、高成本多模态训练、或语音-语言顺序耦合等问题。为突破上述局限,我们提出AuRA方法——将音频编码能力蒸馏至LLM内部。具体而言,AuRA通过轻量级音频嵌入层将相同语音输入同时馈送至ASR编码器(作为教师模型)和经LoRA适配的LLM(作为学生模型),并采用层级蒸馏对齐学生模型的隐状态与对应的教师表征,从而将语音表征内化到LLM侧的轻量级适配中。相较于级联式与串行桥接方法,AuRA在复用预训练语音与语言模型(无需大规模多模态训练)的同时,实现了更紧密的语音-语言联合建模与高效的并行端到端推理。在多项语音-语言基准测试中,AuRA在效果与效率上均持续超越级联系统、语音到LLM适配基线方法以及大规模语音-语言与多模态模型。