The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research.
翻译:人类在上下文中(即仅需少量示例或简单指令)轻松解决多模态任务的能力,是当前多模态系统难以模仿的。在本研究中,我们证明通过有效扩展规模,可以显著增强大型多模态模型的不可知任务上下文学习能力。我们提出Emu2——一个拥有370亿参数的生成式多模态模型,该模型基于统一的自回归目标在大规模多模态序列上进行训练。Emu2展现出强大的多模态上下文学习能力,甚至能涌现出解决需要即时推理的任务(如视觉提示和对象锚定生成)的能力。该模型在少样本场景下的多个多模态理解任务中创下新纪录。在通过指令微调以遵循特定指令后,Emu2进一步在具有挑战性的任务(如针对大型多模态模型的问答基准测试和开放主题驱动生成)中取得最新最优结果。这些成就表明Emu2可作为通用基础模型和接口,服务于广泛的多模态任务。我们公开了代码和模型以促进未来研究。