Large language models (LLMs) exhibit remarkable performance across diverse tasks, indicating their potential for expansion into large speech-text models (LSMs) by integrating speech capabilities. Although unified speech-text pre-training and multimodal data instruction-tuning offer considerable benefits, these methods generally entail significant resource demands and tend to overfit specific tasks. This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning. We explore the instruction-following dynamics within LSMs, identifying a critical issue termed speech anchor bias-a tendency for LSMs to over-rely on speech inputs, mistakenly interpreting the entire speech modality as directives, thereby neglecting textual instructions. To counteract this bias, we introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning. Our experiments across a range of speech-based tasks demonstrate that self-powered LSM mitigates speech anchor bias and improves the fusion of speech and text modalities in LSMs. Data, code and scripts are freely available at https://github.com/ytf-philp/Self-powered-LSM.
翻译:大语言模型(LLMs)在多样化任务中展现出卓越性能,表明其通过集成语音能力扩展为大规模语音-文本模型(LSMs)的潜力。尽管统一的语音-文本预训练与多模态数据指令微调能带来显著优势,这些方法通常需要大量资源且易对特定任务产生过拟合。本研究旨在通过改进普通指令微调的局限性,优化语音数据集在LSM训练中的使用。我们深入探究LSMs内部的指令跟随动态机制,发现一个关键问题——语音锚定偏差:即LSMs过度依赖语音输入,错误地将整个语音模态视为指令,从而忽略文本指令。为消除此偏差,我们提出一种自驱动LSM,利用模型自身生成的增强型自动语音识别数据进行更高效的指令微调。我们在系列语音任务上的实验表明,自驱动LSM能有效缓解语音锚定偏差,并提升LSMs中语音与文本模态的融合能力。相关数据、代码及脚本已开源:https://github.com/ytf-philp/Self-powered-LSM。