In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at https://github.com/OFA-Sys/ONE-PEACE.
翻译:本文探索了一种可扩展的方式,旨在构建面向无限模态的统一通用表示模型。我们发布了ONE-PEACE,一个高度可扩展的40亿参数模型,能够无缝对齐并整合视觉、音频和语言模态的表征。ONE-PEACE的架构包含模态适配器、共享自注意力层和模态前馈神经网络。该设计可通过新增适配器和前馈神经网络轻松扩展至新模态,同时通过自注意力层实现多模态融合。为预训练ONE-PEACE,我们设计了两种模态无关的预训练任务——跨模态对齐对比与模态内去噪对比,两者分别对齐不同模态的语义空间并捕获模态内的细粒度细节。凭借可扩展的架构与预训练任务,ONE-PEACE具备扩展至无限模态的潜力。即使未使用任何视觉或语言预训练模型进行初始化,ONE-PEACE在多种单模态与多模态任务中均取得了领先结果,包括图像分类(ImageNet)、语义分割(ADE20K)、音频-文本检索(AudioCaps、Clotho)、音频分类(ESC-50、FSD50K、VGGSound)、音频问答(AVQA)、图像-文本检索(MSCOCO、Flickr30K)以及视觉定位(RefCOCO/+/g)。代码已开源:https://github.com/OFA-Sys/ONE-PEACE。