The Evolution of Multimodal Model Architectures

This work uniquely identifies and characterizes four prevalent multimodal model architectural patterns in the contemporary multimodal landscape. Systematically categorizing models by architecture type facilitates monitoring of developments in the multimodal domain. Distinct from recent survey papers that present general information on multimodal architectures, this research conducts a comprehensive exploration of architectural details and identifies four specific architectural types. The types are distinguished by their respective methodologies for integrating multimodal inputs into the deep neural network model. The first two types (Type A and B) deeply fuses multimodal inputs within the internal layers of the model, whereas the following two types (Type C and D) facilitate early fusion at the input stage. Type-A employs standard cross-attention, whereas Type-B utilizes custom-designed layers for modality fusion within the internal layers. On the other hand, Type-C utilizes modality-specific encoders, while Type-D leverages tokenizers to process the modalities at the model's input stage. The identified architecture types aid the monitoring of any-to-any multimodal model development. Notably, Type-C and Type-D are currently favored in the construction of any-to-any multimodal models. Type-C, distinguished by its non-tokenizing multimodal model architecture, is emerging as a viable alternative to Type-D, which utilizes input-tokenizing techniques. To assist in model selection, this work highlights the advantages and disadvantages of each architecture type based on data and compute requirements, architecture complexity, scalability, simplification of adding modalities, training objectives, and any-to-any multimodal generation capability.

翻译：本研究独特地识别并刻画了当前多模态领域中四种普遍存在的模型架构模式。通过架构类型对模型进行系统分类，有助于监测多模态领域的发展动态。与近期仅提供多模态架构一般性信息的综述论文不同，本研究对架构细节进行了全面探索，并识别出四种具体的架构类型。这些类型根据其将多模态输入整合到深度神经网络模型中的方法加以区分。前两种类型（A型和B型）在模型的内部层中深度融合多模态输入，而后两种类型（C型和D型）则在输入阶段实现早期融合。A型采用标准的交叉注意力机制，而B型则在内部层使用定制设计的层进行模态融合。另一方面，C型利用模态特定的编码器，而D型则借助分词器在模型的输入阶段处理各模态。所识别的架构类型有助于监测任意到任意多模态模型的开发。值得注意的是，C型和D型目前在构建任意到任意多模态模型时更受青睐。C型以其非分词化的多模态模型架构为特点，正逐渐成为采用输入分词化技术的D型的一种可行替代方案。为辅助模型选择，本研究基于数据和计算需求、架构复杂性、可扩展性、添加模态的简易性、训练目标以及任意到任意多模态生成能力，重点阐述了每种架构类型的优缺点。