Integrating multiple generative foundation models, especially those trained on different modalities, into something greater than the sum of its parts poses significant challenges. Two key hurdles are the availability of aligned data (concepts that contain similar meaning but is expressed differently in different modalities), and effectively leveraging unimodal representations in cross-domain generative tasks, without compromising their original unimodal capabilities. We propose Zipper, a multi-tower decoder architecture that addresses these concerns by using cross-attention to flexibly compose multimodal generative models from independently pre-trained unimodal decoders. In our experiments fusing speech and text modalities, we show the proposed architecture performs very competitively in scenarios with limited aligned text-speech data. We also showcase the flexibility of our model to selectively maintain unimodal (e.g., text-to-text generation) generation performance by freezing the corresponding modal tower (e.g. text). In cross-modal tasks such as automatic speech recognition (ASR) where the output modality is text, we show that freezing the text backbone results in negligible performance degradation. In cross-modal tasks such as text-to-speech generation (TTS) where the output modality is speech, we show that using a pre-trained speech backbone results in superior performance to the baseline.
翻译:将多个生成式基础模型(特别是那些在不同模态上训练的模型)整合成一个超越其各部分之和的整体,面临着重大挑战。两个关键障碍在于对齐数据的可用性(即在不同模态中以不同方式表达但含义相似的概念),以及在跨域生成任务中有效利用单模态表示而不损害其原有的单模态能力。我们提出了Zipper,一种多塔解码器架构,该架构通过使用交叉注意力机制,将独立预训练的单模态解码器灵活组合成多模态生成模型,从而解决了上述问题。在融合语音和文本模态的实验中,我们表明所提出的架构在有限的对齐文本-语音数据场景下表现出极强的竞争力。我们还展示了该模型通过冻结相应模态塔(例如文本塔)来选择性保持单模态(如文本到文本生成)生成性能的灵活性。在输出模态为文本的跨模态任务(如自动语音识别ASR)中,我们证明冻结文本骨干网络导致的性能下降可忽略不计。在输出模态为语音的跨模态任务(如文本到语音生成TTS)中,我们证明使用预训练的语音骨干网络能获得优于基线的性能。