Following the remarkable success of Large Language Models (LLMs) in NLP tasks, there is increasing interest in extending their capabilities to speech -- the most common form in communication. To integrate speech into LLMs, one promising approach is dense feature prepending (DFP) which prepends the projected speech representations to the textual representations, allowing end-to-end training with the speech encoder. However, DFP typically requires connecting a text decoder to a speech encoder. This raises questions about the importance of having a sophisticated speech encoder for DFP, and how its performance compares with a standard encoder-decoder (i.e. cross-attention) architecture. In order to perform a controlled architectural comparison, we train all models from scratch, rather than using large pretrained models, and use comparable data and parameter settings, testing speech-to-text recognition (ASR) and translation (ST) on MuST-C v1.0 and CoVoST2 datasets. We study the influence of a speech encoder in DFP. More importantly, we compare DFP and cross-attention under a variety of configurations, such as CTC compression, sequence-level knowledge distillation, generation speed and GPU memory footprint on monolingual, bilingual and multilingual models. Despite the prevalence of DFP over cross-attention, our overall results do not indicate a clear advantage of DFP.
翻译:随着大语言模型(LLM)在自然语言处理任务中取得显著成功,将其能力扩展至语音——这一最常见的交流形式——的兴趣日益增长。为将语音整合到LLM中,一种有前景的方法是密集特征前置(DFP),该方法将投影后的语音表征前置到文本表征之前,从而实现与语音编码器的端到端训练。然而,DFP通常需要将文本解码器连接到语音编码器。这引发了关于DFP是否需要复杂的语音编码器,以及其性能与标准编码器-解码器(即交叉注意力)架构相比如何的问题。为进行受控的架构比较,我们从零开始训练所有模型,而非使用大型预训练模型,并采用可比较的数据和参数设置,在MuST-C v1.0和CoVoST2数据集上测试语音转文本识别(ASR)和翻译(ST)。我们研究了语音编码器在DFP中的影响。更重要的是,我们在多种配置下比较了DFP和交叉注意力,例如在单语、双语和多语言模型上的CTC压缩、序列级知识蒸馏、生成速度和GPU内存占用。尽管DFP比交叉注意力更为流行,但我们的总体结果并未表明DFP具有明显优势。