We present a novel Speech Augmented Language Model (SALM) with {\em multitask} and {\em in-context} learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a modality adapter module, and LoRA layers to accommodate speech input and associated task instructions. The unified SALM not only achieves performance on par with task-specific Conformer baselines for Automatic Speech Recognition (ASR) and Speech Translation (AST), but also exhibits zero-shot in-context learning capabilities, demonstrated through keyword-boosting task for ASR and AST. Moreover, {\em speech supervised in-context training} is proposed to bridge the gap between LLM training and downstream speech tasks, which further boosts the in-context learning ability of speech-to-text models. Proposed model is open-sourced via NeMo toolkit.
翻译:摘要:我们提出了一种新颖的语音增强语言模型(SALM),具备多任务和上下文学习能力。SALM由冻结的文本大语言模型、音频编码器、模态适配器模块以及LoRA层组成,用于处理语音输入及相关任务指令。统一的SALM不仅在自动语音识别(ASR)与语音翻译(AST)任务上达到了与任务特定Conformer基线相当的性能,还展现了零样本上下文学习能力,这通过ASR和AST中的关键词增强任务得到验证。此外,我们提出了语音监督的上下文训练方法,以弥合大语言模型训练与下游语音任务之间的差距,进一步提升了语音到文本模型的上下文学习能力。该模型已通过NeMo工具包开源发布。