By leveraging the power of Large Language Models(LLMs) and speech foundation models, state of the art speech-text bimodal works can achieve challenging tasks like spoken translation(ST) and question answering(SQA) altogether with much simpler architectures. In this paper, we utilize the capability of Whisper encoder and pre-trained Yi-6B. Empirical results reveal that modal alignment can be achieved with one layer module and hundred hours of speech-text multitask corpus. We further swap the Yi-6B with human preferences aligned version of Yi-6B-Chat during inference, and discover that the alignment capability is applicable as well. In addition, the alignment subspace revealed by singular value decomposition(SVD) also implies linear alignment subspace is sparse, which leaves the possibility to concatenate other features like voice-print or video to expand modality.
翻译:通过利用大语言模型(LLMs)和语音基础模型的能力,当前最先进的语音-文本双模态工作能够以更简洁的架构同时实现语音翻译(ST)和语音问答(SQA)等具有挑战性的任务。本文利用Whisper编码器和预训练Yi-6B模型的特性,实证研究表明:通过单层对齐模块和数百小时的语音-文本多任务语料库即可实现模态对齐。我们进一步在推理阶段将Yi-6B替换为经过人类偏好对齐的Yi-6B-Chat版本,发现对齐能力依然适用。此外,通过奇异值分解(SVD)揭示的对齐子空间表明线性对齐子空间具有稀疏性,这为融合声纹或视频等其他特征以扩展模态提供了可能性。