4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on. In this paper, we expand upon the capabilities of them by training a single model on tens of highly diverse modalities and by performing co-training on large-scale multimodal datasets and text corpora. This includes training on several semantic and geometric modalities, feature maps from recent state of the art models like DINOv2 and ImageBind, pseudo labels of specialist models like SAM and 4DHumans, and a range of new modalities that allow for novel ways to interact with the model and steer the generation, for example image metadata or color palettes. A crucial step in this process is performing discrete tokenization on various modalities, whether they are image-like, neural network feature maps, vectors, structured data like instance segmentation or human poses, or data that can be represented as text. Through this, we expand on the out-of-the-box capabilities of multimodal models and specifically show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance. This enables more fine-grained and controllable multimodal generation capabilities and allows us to study the distillation of models trained on diverse data and objectives into a unified model. We successfully scale the training to a three billion parameter model using tens of modalities and different datasets. The resulting models and training code are open sourced at 4m.epfl.ch.

翻译：当前如4M或UnifiedIO等多模态与多任务基础模型展现出令人瞩目的成果，但在实际应用中，其开箱即用地接受多样化输入并执行多样化任务的能力受限于其训练所覆盖的（通常相当有限的）模态与任务数量。本文通过训练单一模型覆盖数十种高度多样化的模态，并联合大规模多模态数据集与文本语料库进行协同训练，进一步拓展了上述模型的潜能。这包括对多种语义与几何模态、来自DINOv2和ImageBind等最新先进模型的特征图、SAM和4DHumans等专家模型的伪标签，以及一系列允许以新方式与模型交互并引导生成的新模态（例如图像元数据或调色板）进行训练。此过程中的关键步骤是对各类模态执行离散标记化处理——无论是类图像数据、神经网络特征图、向量、结构化数据（如实例分割或人体姿态），还是可表示为文本的数据。通过此方法，我们扩展了多模态模型的开箱即用能力，并具体展示了训练单一模型以解决至少比现有模型多三倍的任务/模态，且性能无损的可能性。这实现了更精细、更可控的多模态生成能力，并使我们能够研究将基于多样化数据与目标训练的模型蒸馏为统一模型的过程。我们成功将训练扩展至包含数十种模态与不同数据集的三十亿参数模型。最终模型与训练代码已在4m.epfl.ch开源。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日