Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.
翻译:有效的扩展和灵活的任务接口使大型语言模型能够在许多任务中表现出色。我们提出PaLI(路径语言与图像模型),这是一种将这一方法扩展到语言与视觉联合建模的模型。PaLI基于视觉和文本输入生成文本,并通过这一接口执行多种视觉、语言和多模态任务,涵盖多种语言。为训练PaLI,我们利用了大规模预训练的编码器-解码器语言模型和视觉Transformer(ViT)。这使我们能够充分利用其现有能力,并受益于训练它们所投入的巨大成本。我们发现视觉与语言组件的联合扩展至关重要。由于现有的语言Transformer远大于其视觉对应模型,我们训练了一个拥有40亿参数的大型ViT(ViT-e),以量化更大容量视觉模型带来的收益。为训练PaLI,我们构建了一个包含100种语言以上、拥有100亿张图像及对应文本的大规模多语言预训练任务混合数据集。PaLI在多个视觉与语言任务(如图像描述、视觉问答、场景文本理解)中达到了最先进水平,同时保持了简单、模块化和可扩展的设计。