We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google tagging API. We are releasing the RAM at \url{https://recognize-anything.github.io/} to foster the advancements of large models in computer vision.
翻译:我们提出了Recognize Anything Model(RAM):一个用于图像标注的强大基础模型。RAM为计算机视觉领域的大模型迈出了实质性的一步,展示了零样本识别任何常见类别并达到高准确率的能力。RAM引入了一种全新的图像标注范式,利用大规模图像-文本对进行训练,而非依赖人工标注。RAM的开发包含四个关键步骤。首先,通过自动文本语义解析,在规模化层面获取无需人工标注的图像标签。随后,通过统一描述与标注任务(分别以原始文本和解析标签作为监督信号),训练一个初始模型用于自动标注。第三,利用数据引擎生成额外标注并清理错误标注。最后,使用处理后的数据对模型进行重新训练,并通过更小但更高质量的数据集进行微调。我们在多个基准测试上评估了RAM的标注能力,观察到其零样本性能令人印象深刻,显著优于CLIP和BLIP。值得注意的是,RAM甚至超越了全监督方法,并与谷歌标注API展现出具有竞争力的性能。我们已在\url{https://recognize-anything.github.io/}发布RAM,以推动计算机视觉中大模型的发展。