LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only domain adaptation crucial. Existing methods typically rely on either fine-tuning the LLM alone or employing pseudo-audio prompts. The former neglects essential acoustic context, while the latter either suffers from limited scalability in data-scarce conditions, or yields inexpressive prompts by leveraging only textual features, ignoring audio modality. To address this, we propose an enhanced framework that explicitly models speech-text alignment. Our method efficiently generates highly expressive pseudo-audio prompts that bridges the modality gap, enabling effective target-domain adaptation. Experiments demonstrate that our approach outperforms existing text-only methods, improving both overall error rates and out-of-vocabulary coverage.
翻译:基于大型语言模型的自动语音识别模型通过连接音频编码器与LLM展现出强大性能。然而,配对语音与转录文本的数据稀缺性常制约其向新领域的迁移,这使得纯文本领域自适应变得至关重要。现有方法通常依赖单独微调LLM或使用伪音频提示:前者忽视关键的声学上下文,后者在数据稀缺场景下可扩展性受限,或仅利用文本特征生成缺乏表现力的提示,忽略了音频模态。为此,我们提出一个显式建模语音-文本对齐的增强框架。该方法高效生成具备高表现力的伪音频提示,弥合模态差异,实现有效的目标领域自适应。实验表明,我们的方法优于现有纯文本方法,在整体错误率和集外词覆盖率上均有提升。