We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples
翻译:我们提出了AudioPaLM,一种用于语音理解与生成的大语言模型。AudioPaLM将基于文本的语言模型PaLM-2 [Anil等人,2023]与基于语音的语言模型AudioLM [Borsos等人,2022]融合为统一的模态融合架构,能够处理和生成文本及语音,应用包括语音识别与语音到语音翻译。AudioPaLM继承了AudioLM保留副语言信息(如说话人身份和语调)的能力,以及仅存在于文本大语言模型(如PaLM-2)中的语言学知识。我们证明,用纯文本大语言模型的权重初始化AudioPaLM可提升语音处理性能,成功利用预训练中更大量的文本训练数据辅助语音任务。最终模型在语音翻译任务上显著优于现有系统,并具备对多种训练中未见输入/目标语言组合进行零样本语音到文本翻译的能力。AudioPaLM还展示了音频语言模型的特性,例如基于短语音提示实现跨语言的声音迁移。我们已在https://google-research.github.io/seanet/audiopalm/examples 发布方法示例。