PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.
翻译:PaliGemma是一个基于SigLIP-So400m视觉编码器和Gemma-2B语言模型的开源视觉语言模型。该模型被训练成为一个通用且知识广泛的基础模型,具备出色的迁移能力。它在多种开放世界任务中均表现出色。我们在近40项多样化任务上评估了PaliGemma,不仅包括标准VLM基准测试,还涵盖遥感图像分析、图像分割等专业领域任务。