Integrating multiple generative foundation models, especially those trained on different modalities, into something greater than the sum of its parts poses significant challenges. Two key hurdles are the availability of aligned data (concepts that contain similar meaning but is expressed differently in different modalities), and effectively leveraging unimodal representations in cross-domain generative tasks, without compromising their original unimodal capabilities. We propose Zipper, a multi-tower decoder architecture that addresses these concerns by using cross-attention to flexibly compose multimodal generative models from independently pre-trained unimodal decoders. In our experiments fusing speech and text modalities, we show the proposed architecture performs very competitively in scenarios with limited aligned text-speech data. We also showcase the flexibility of our model to selectively maintain unimodal (e.g., text-to-text generation) generation performance by freezing the corresponding modal tower (e.g. text). In cross-modal tasks such as automatic speech recognition (ASR) where the output modality is text, we show that freezing the text backbone results in negligible performance degradation. In cross-modal tasks such as text-to-speech generation (TTS) where the output modality is speech, we show that using a pre-trained speech backbone results in superior performance to the baseline.
翻译:将多个生成式基础模型(尤其是那些在不同模态上训练的模型)整合成超越其各部分之和的整体,面临着重大挑战。两个关键障碍在于:对齐数据的可用性(即在不同模态中以不同方式表达但含义相似的概念),以及在跨域生成任务中有效利用单模态表示而不损害其原有的单模态能力。我们提出Zipper,一种多塔解码器架构,通过使用交叉注意力灵活组合来自独立预训练单模态解码器的多模态生成模型,从而解决这些问题。在融合语音和文本模态的实验中,我们表明所提出的架构在对齐文本-语音数据有限的场景下表现出极具竞争力的性能。我们还展示了模型通过冻结相应模态塔(例如文本塔)来选择性保持单模态(例如文本到文本生成)生成性能的灵活性。在输出模态为文本的跨模态任务(如自动语音识别ASR)中,我们证明冻结文本骨干网络导致的性能下降可忽略不计。在输出模态为语音的跨模态任务(如文本到语音生成TTS)中,我们表明使用预训练的语音骨干网络能获得优于基线的性能。