Recent advancement of large language models (LLMs) has led to significant breakthroughs across various tasks, laying the foundation for the development of LLM-based speech translation systems. Existing methods primarily focus on aligning inputs and outputs across modalities while overlooking deeper semantic alignment within model representations. To address this limitation, we propose an Adaptive Inner Speech-Text Alignment (AI-STA) method to bridge the modality gap by explicitly aligning speech and text representations at selected layers within LLMs. To achieve this, we leverage the optimal transport (OT) theory to quantify fine-grained representation discrepancies between speech and text. Furthermore, we utilize the cross-modal retrieval technique to identify the layers that are best suited for alignment and perform joint training on these layers. Experimental results on speech translation (ST) tasks demonstrate that AI-STA significantly improves the translation performance of large speech-text models (LSMs), outperforming previous state-of-the-art approaches. Our findings highlight the importance of inner-layer speech-text alignment in LLMs and provide new insights into enhancing cross-modal learning.
翻译:大语言模型(LLMs)的最新进展已在多项任务中取得重大突破,为开发基于LLM的语音翻译系统奠定了基础。现有方法主要关注跨模态的输入输出对齐,而忽视了模型表征内部更深层次的语义对齐。为解决这一局限,我们提出了一种自适应内部语音-文本对齐(AI-STA)方法,通过在大语言模型内部选定层中显式对齐语音与文本表征,以弥合模态鸿沟。为此,我们利用最优传输(OT)理论来量化语音与文本之间细粒度的表征差异。此外,我们采用跨模态检索技术来识别最适合对齐的层,并在这些层上进行联合训练。在语音翻译(ST)任务上的实验结果表明,AI-STA显著提升了大型语音-文本模型(LSMs)的翻译性能,超越了以往最先进的方法。我们的研究结果凸显了大语言模型中内部层间语音-文本对齐的重要性,并为增强跨模态学习提供了新的见解。