Multimodal learning aims to build models that can process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities ($\textit{e.g.}$ natural language, 2D images, 3D point clouds, audio, video, time series, tabular data) due to the inherent gaps among them. In this work, we propose a framework, named Meta-Transformer, that leverages a $\textbf{frozen}$ encoder to perform multimodal perception without any paired multimodal training data. In Meta-Transformer, the raw input data from various modalities are mapped into a shared token space, allowing a subsequent encoder with frozen parameters to extract high-level semantic features of the input data. Composed of three main components: a unified data tokenizer, a modality-shared encoder, and task-specific heads for downstream tasks, Meta-Transformer is the first framework to perform unified learning across 12 modalities with unpaired data. Experiments on different benchmarks reveal that Meta-Transformer can handle a wide range of tasks including fundamental perception (text, image, point cloud, audio, video), practical application (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph, tabular, and time-series). Meta-Transformer indicates a promising future for developing unified multimodal intelligence with transformers. Code will be available at https://github.com/invictus717/MetaTransformer
翻译:多模态学习旨在构建能够处理和关联多种信息模态的模型。尽管该领域已发展多年,但由于各模态(如自然语言、二维图像、三维点云、音频、视频、时间序列、表格数据)之间存在固有差异,设计统一网络处理多模态信息仍具挑战。本文提出名为Meta-Transformer的框架,利用**冻结**编码器在无需配对多模态训练数据的情况下实现多模态感知。Meta-Transformer将来自不同模态的原始输入数据映射至共享词元空间,使具有固定参数的后续编码器能够提取输入数据的高层语义特征。该框架由三大核心组件构成:统一数据分词器、模态共享编码器以及面向下游任务的特定任务头部,是首个在12种模态上利用非配对数据实现统一学习的框架。不同基准实验表明,Meta-Transformer可处理广泛任务,包括基础感知(文本、图像、点云、音频、视频)、实际应用(X射线、红外、高光谱及惯性测量单元)以及数据挖掘(图数据、表格数据与时间序列)。Meta-Transformer为开发基于Transformer的统一多模态智能指明未来方向。代码开源于https://github.com/invictus717/MetaTransformer