The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research.
翻译:人类能够轻松在上下文环境(即仅通过少量示例或简单指令)中解决多模态任务的能力,是当前多模态系统难以模仿的核心挑战。本研究证明,通过有效的规模扩展,大型多模态模型的任务无关上下文学习能力可被显著增强。我们提出Emu2——一个拥有370亿参数的生成式多模态模型,基于统一自回归目标在大规模多模态序列上进行训练。Emu2展现出强大的多模态上下文学习能力,甚至能涌现解决需要即时推理的任务(如视觉提示引导和物体锚定生成)。该模型在少样本设置下的多项多模态理解任务中刷新纪录。经指令微调适配特定指令后,Emu2进一步在大型多模态模型问答基准测试和开放主题驱动生成等挑战性任务中达到业界最佳水平。这些成果表明Emu2可作为多模态任务的基础模型与通用接口。代码与模型已公开以促进未来研究。