The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, where we interpret different modalities as different languages, and treat multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, which provide a unified interface across modalities and significantly decrease the computational cost. In the proposed TMT, a multi-modal encoder-decoder conducts the core translation, whereas modality-specific processing is conducted only within the tokenization and detokenization stages. We evaluate the proposed TMT on all six modality translation tasks. TMT outperforms single model counterparts consistently, demonstrating that unifying tasks is beneficial not only for practicality but also for performance.
翻译:联合处理多模态信息的能力正成为一项关键任务。然而,配对多模态数据的有限性以及多模态学习中的巨大计算需求阻碍了其发展。我们提出一种新颖的三模态翻译(TMT)模型,该模型能够在语音、图像和文本任意模态之间进行翻译。我们引入了一个新颖视角:将不同模态视为不同语言,并将多模态翻译视为一个成熟的机器翻译问题。为此,我们将语音和图像数据分词为离散符号,这些符号提供了跨模态的统一接口,并显著降低了计算成本。在所提出的TMT中,多模态编码器-解码器执行核心翻译,而模态特定处理仅在分词和去分词阶段进行。我们在全部六项模态翻译任务上评估了所提出的TMT。TMT始终优于单模型对应方案,表明任务统一不仅有利于实用性,也有利于性能提升。