We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities.
翻译:我们提出Florence-2,这是一种新颖的视觉基础模型,采用统一的、基于提示语的表示方法,适用于多种计算机视觉与视觉-语言任务。尽管现有大型视觉模型在迁移学习方面表现出色,但它们在利用简单指令执行多样化任务方面存在困难,这种能力意味着需处理不同空间层级与语义粒度的复杂性。Florence-2被设计为将文本提示作为任务指令,并以文本形式生成期望结果,无论是图像描述、目标检测、指代理解还是语义分割。这种多任务学习框架需要大规模、高质量的标注数据。为此,我们通过自动化图像标注与模型优化的迭代策略,共同开发了FLD-5B数据集,该数据集包含1.26亿张图像上的54亿条全面视觉标注。我们采用序列到序列结构来训练Florence-2,使其能执行多样化且全面的视觉任务。在大量任务上的广泛评估表明,Florence-2作为强大的视觉基础模型候选者,具备前所未有的零样本学习与微调能力。